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Abstract 


The  Cell  BE  Architecture  connects  a  Power  processor  with  several  “synergis¬ 
tic  processing  units”  via  a  high-speed  bus,  allowing  parallel  processing  on  a  chip. 
Architectural  features  enabling  high  speed  performance  include  SIMD,  many  wide 
registers,  DMA  provisions,  and  dual-issue  instructions.  We  have  developed  ex¬ 
traordinarily  high  performance  implementatioins  of  SHA-256,  SHA-1  and  MD5 
for  this  architecture.  We  have  also  developed  parallelized  implementations  of  AES 
Encryption. 


1  Introduction 

The  Cell  BE  is  a  general  purpose  microprocessor  architecture  designed  by  Sony,  Toshiba  and 
IBM  that  features  a  heterogenous  processing  architecture  and  a  high-speed  interconnect  be¬ 
tween  the  processing  elements  and  the  system  memory.  The  Cell  Broadband  Engine  (Cell  BE) 
is  a  3.2  Ghz  microprocessor  available  on  the  Sony  Playstation  3,  on  the  dual-processor  configu¬ 
ration  on  an  IBM  QS22  card  for  the  IBM  BladeCenter,  and  as  a  PCI  co-processor  card  available 
from  IBM  and  Mercury  Computer  Systems. 

This  paper  presents  our  results  in  developing  high-speed  implementations  of  SHA-256,  SHA-1 
and  MD5  hashing  algorithms.  Although  the  Cell  has  a  reputation  as  being  hard  to  program, 
we  show  that  exercising  direct  control  over  a  small  number  of  independent  processing  elements 
makes  it  possible  to  achieve  extraordinarily  high  performance  on  real-world  problems. 

In  November  2009  IBM  confirmed  rumors  that  there  would  be  no  further  development  on  the 
existing  Cell  processor  line[7].  Nevertheless,  the  results  of  this  paper  are  important  for  several 
reasons: 

1.  These  results  show  that  architectural  provisions  on  the  Cell  such  as  SIMD  lanes  and 
DMA  can  bring  significant  computational  power  to  real-world  problems  on  commodity 
hardware. 

2.  Although  the  current  generation  of  Cell  architecture  is  being  abandoned,  IBM  will  be 
including  the  Cell  VEIW  cores  in  upcoming  versions  of  its  Power  7  microprocessors 
targeted  at  supercomputers.  IBM’s  QS22  blades  will  remain  in  production  for  more  than 
two  years.  Organizations  needing  high-performance  functions  today  with  the  flexibility 
of  a  general  purpose  CPU  and  enterprise-class  hardware  can  use  our  code  today. 

3.  Many  of  the  lessons  learned  from  developing  high-speed  implementations  of  these  algo¬ 
rithms  on  Cell  will  be  applicable  to  future  heterogeneous  architectures. 

We  present  the  following: 

•  Implementation  strategies  used  to  optimize  hashing. 

•  Timing  results  for  hashing  operations. 

•  Additional  timing  results  for  AES  Encryption. 
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1.1  Prior  Work 

IBM  has  reported  the  speed  of  AES  eneryption  on  the  Cell  BE  as  one  of  its  standard  benehmarks 
from  the  first  of  its  published  Cell  papers.  The  eode  that  produees  those  timing  results  has  never 
been  publiely  released. 

Yang  and  Goodman  demonstrated  AES  running  on  an  AMD  HD  2900  XT  GPU  with  speeds 
“16  times  faster  than  high  end  CPUs. [8]” 

Harrison  and  Waldron  implemented  AES  on  the  GeEoree  6600GT  and  GeEoree  7900GT  graphie 
eoproeessor  eards  and  found  that  the  performanee  of  the  GPUs  was  limited  by  the  speed  that 
data  eould  be  transferred  to  and  from  the  GPU  [4]. 

Bhaskar  et  al.  developed  a  strategy  for  effieient  Galois  Eield  Arithmetie  implementation  on 
SIMD  Arehiteetures  [  1  ] . 

Costigan  and  Seott  developed  a  version  of  SSE  that  used  the  Cell  BE  for  eryptographie  support[3] . 
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2  Implementation 

The  implementation  of  our  hashing  algorithms  utilize  the  architecture  of  the  Cell  BE: 

•  SIMD  lanes,  instruction  pipelining,  and  many  registers 

•  Parallel  Processing 

•  SPU  initialization 

•  PPU  task  delegation 

•  Shared  data  structures 

•  DMA 

•  Synchronization 

2.1  SIMD  lanes,  instruction  pipelining,  and  many  registers 

In  implementing  hashing  and  encryption  on  the  Cell  BE,  we  have  exploited  several  distinct 
levels  of  parallelism.  (Here  we  use  a  broad  meaning  of  parallelism:  utilizing  different  hardware 
resources  simultaneously  in  achieving  a  task.) 

1.  SIMD  instructions  can  operate  on  several  parts  of  registers  in  parallel. 

2.  Each  instruction  pipeline  can  perform  (different  stages  of)  several  instructions  at  once. 

3.  The  two  pipelines  in  an  SPU  can  both  execute  instructions  in  parallel. 

4.  Multiple  SPUs  can  work  on  a  task  simultaneously,  in  a  pipelined  fashion  and/or  perform¬ 
ing  the  same  steps  on  parallel  data. 

5.  In  a  dual-processor  configuration,  both  Cells  can  work  in  parallel. 

By  optimizing  all  these  levels  of  parallelism,  our  implementations  of  hashing  and  encryption 
set  new  speed  records  for  this  hardware. 

To  control  the  three  lowest  levels  of  parallelism,  we  wrote  functions  to  hash  or  encrypt  a  buffer¬ 
ful  of  data  in  SPU  assembly  code,  and  optimized  these  by  hand  for  speed.  Careful  scheduling 
of  instructions  minimized  data  latency,  where  one  instruction  waits  for  the  result  of  a  previous 
instruction;  this  involved  finding  steps  in  the  algorithm  that  could  be  done  in  parallel  over  a  few 
clock  cycles,  for  parallelism  in  a  single  pipeline  (level  2  above). 

Another  goal  was  balancing  instructions  between  the  two  pipelines  (level  3),  which  is  compli¬ 
cated  by  the  fact  that  the  instruction  sets  for  the  two  pipelines  are  completely  disjoint.  However, 
sometimes  it  was  possible  and  advantageous  to  replace  an  instruction  in  one  pipeline  by  a  dif¬ 
ferent  one  or  even  several  instructions  in  the  other  pipeline  to  effect  the  same  result  with  better 
parallelism.  (The  instruction  scheduling  also  needed  to  satisfy  memory  alignment  constraints 
to  get  dual  issue:  instructions  going  to  both  pipelines  at  the  same  clock  cycle.) 

Eor  all  the  hashing  implementations,  we  fully  unrolled  the  round  loop.  Eor  encryption  (CTR 
and  ECB  modes),  we  did  not  unroll  the  round  loop,  but  partially  unrolled  the  outer  block  loop, 
to  encrypt  four  blocks  together.  Other  techniques  for  increasing  speed  include: 

•  Instead  of  conditional  jumps,  we  used  the  SPU’s  predicate  operations,  allowing  execution 
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to  progress  at  full  speed  without  eaehe  misses. 

•  Dynamie  braneh  hinting  allowed  eaehe  pre-fetches  for  looping  without  penalty. 

The  SIMD  parallelism  (level  1  above)  varied  depending  on  the  algorithm.  For  AES,  the  128-bit 
block  size  matches  the  128-bit  vector  register  size  of  the  SPU.  The  SPU  SIMD  instructions 
made  it  possible  to  operate  on  the  128-bit  register  as  a  whole  (ShiftRows,  AddRoundKey),  as 
four  32-bit  words  (MixColumns),  or  as  sixteen  bytes  (SubBytes  table  lookups),  allowing  us  to 
perform  the  AES  round  operations  on  the  whole  block  at  the  same  time.  (We  also  tried  the 
standard  T-table  approach  that  combines  SubBytes,  ShiftRows,  and  MixColumns  into  simple 
table  look-ups,  but  that  was  less  efficient  than  the  SIMD  parallelized  approach.) 

All  three  hashing  algorithms  are  defined  in  terms  of  a  sequence  of  operations  on  32-bit  words, 
which  fits  nicely  with  the  word  size  of  the  SPU.  However,  all  these  operations  correspond 
directly  to  “even”-pipeline  instructions;  none  are  available  directly  in  the  “odd”-pipeline.  While 
this  gives  unbalanced  pipelines,  it  does  mean  the  “odd”  pipeline  is  free  to  move  words  around 
or  combine  them  for  later  SIMD  operations.  Also,  when  hashing  a  single  file  using  SHA-256, 
it  was  possible  to  replace  some  of  the  word  rotations  (“even”  instructions)  with  quad-word 
rotations  (“odd”  instructions),  by  keeping  all  four  words  of  each  register  identical. 

Eor  hashing  a  single  file,  opportunities  for  SIMD  acceleration  are  not  so  obvious,  due  to  the 
sequential  nature  of  the  algorithms.  The  SHA  algorithms  involve  “message  scheduling”  to 
extend  the  16-word  message  block  to  a  new  word  for  each  round.  There,  SIMD  can  do  most  of 
the  message  schedule  for  four  rounds  at  once;  the  limitation  is  where  results  are  needed  that  are 
less  than  four  rounds  old.  In  the  hashing  itself,  SIMD  was  used  for  certain  additions  done  four 
at  a  time,  and  some  Boolean  functions  two  words  at  a  time  (on  hash  words  two  or  more  rounds 
old). 

A  completely  different  approach  to  using  SIMD  is  to  do  four  hashes  in  parallel,  for  four  different 
files  (messages);  we  call  this  the  4-lane  approach.  (This  use  of  SIMD  precludes  the  others 
mentioned  above.)  When  processing  multiple  files,  this  is  by  far  the  most  efficient  way.  And 
this  obviously  extends  to  hashing  more  files  using  more  SPUs  (level  4  parallelism),  up  to  32 
files  simultaneously  on  the  8  SPUs  of  a  Cell,  for  tremendous  throughput. 

If  only  a  single  source  is  to  be  hashed,  as  fast  as  possible,  than  that  task  can  be  pipelined  be¬ 
tween  two  SPUs  (also  level  4  parallelism).  The  message  schedule  can  be  preprocessed,  since  it 
depends  only  on  the  message  data,  completely  independent  of  the  current  hash  value.  Similarly, 
the  addition  of  round  constants  can  be  included  in  the  preprocessing.  (This  approach  does  not 
apply  to  MD5,  since  it  has  no  message  scheduling,  and  moreover,  we  got  the  rounds  to  run 
in  the  minimum  possible  number  of  clock  cycles.)  Then  the  second  SPU  does  all  the  actual 
hashing,  relieved  of  the  burden  of  the  message  schedule  and  round  constants.  This  pipelining 
increases  throughput  of  SHA-256  by  about  16%. 

The  speed  advantages  of  these  two  different  approaches  to  parallelizing  hashing,  either  using 
two  SPUs  in  hashing  one  file,  or  using  one  SPU  to  hash  four  files  at  once,  are  shown  in  Table  3. 


4 


Figure  1:  Datastreams  are  hashed  independently  on  multiple  SPUs. 


Figure  2:  Four  separate  datastreams  are  hashed  simultaneously  on  one  SPU  using  SIMD  lanes. 


2.2  Parallel  Processing 

The  calculation  of  hash  values  is  a  fairly  linear  approach  because  the  computation  of  a  block  of 
data  depends  on  the  result  of  the  block  before  it.  Regardless,  we  are  able  to  achieve  paralleliza¬ 
tion  using  the  following  approaches: 

1.  Hashing  irrelated  data  streams  on  separate  SPUs  as  shown  in  Figure  1.  This  approach 
improves  performance  linearly  as  the  number  of  SPUs  are  utilized.  Hashing  concurrently 
on  all  eight  SPUs  of  a  Cell  BE  improves  performance  by  eight. 

2.  Hashing  four  irrelated  data  streams  concurrently  on  one  SPU  as  shown  in  Figure  2  by 
using  the  four  SIMD  lanes  of  the  128-bit  registers  and  instructions  of  the  SPU.  We  use 
two  approaches  for  parallel  processing. 

3.  Hashing  one  data  stream  and  pipelining  block  processing  across  two  SPUs  as  shown 
in  Figure  3.  The  first  SPU  creates  message  digests  from  input  text  while  the  second 
SPU  concurrently  hashes  message  digests.  Although  this  approach  does  not  provide  the 
greatest  throughput  for  multiple  streams,  it  does  provide  the  minimum  latency  for  one 
data  stream. 
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Figure  3:  Blocks  of  a  datastream  are  pipelined  across  two  SPUs.  In  SPU  1,  a  message  digest  is  produced. 
In  SPU  2  the  message  digest  is  hashed. 


2.3  SPU  Initialization 

The  PPU  starts  one  or  more  SPUs.  During  initialization,  the  PPU  and  each  SPU  shares  with 
eachother  the  EA  of  their  shared  data  structures,  enabling  data  transfers  and  synchronization 
between  processors. 

Once  initialized,  the  system  is  ready  to  perform  hashing  jobs. 

Initialization  steps  are  as  follows: 

1.  The  PPU  clears  all  synchronization  variables  to  ensure  that  they  do  not  inadvertently 
indicate  a  signal  event. 

2.  The  PPU  starts  SPU(s),  providing  the  EA  of  the  PPU’s  shared  data  space  in  the  argp  value 
passed  to  the  SPU  when  the  SPU  starts  running.  The  SPUs  use  their  assigned  slot  of  this 
shared  space  to  signal  job  completion. 

3.  The  PPU  begins  to  wait  to  obtain  the  data  spaces  of  the  SPUs.  The  PPU  waits  using 
the  32-bit  hardware  mailbox  synchronization  mechanism.  DMA  communication  could 
have  been  used  instead  of  the  mailbox  mechanism,  but  mailbox  is  simpler  to  implement, 
considering  that  the  synchronization  consists  of  a  32-bit  blocking  transfer. 

4.  As  each  SPU  starts,  it  performs  the  following  initialization  sequence: 

(a)  It  clears  all  synchronization  variables  to  ensure  that  they  do  not  inadvertently  indi¬ 
cate  a  signal  event. 

(b)  It  transmits  the  ES  address  of  its  shared  data  space  to  the  PPU. 

(c)  It  becomes  ready  to  accept  jobs. 

Eor  pipelined  hashing,  which  requires  two  SPUs,  the  following  additional  processing  is  per¬ 
formed  before  the  SPUs  are  ready  to  accept  jobs: 

1.  The  two  SPUs  wait  for  a  second  mailbox  signal  providing  the  EA  of  the  shared  data  space 
of  the  other  SPU. 

2.  The  PPU  calculates  the  EA  of  the  shared  data  spaces  of  the  SPUs  by  adding  the  ES  of  the 
shared  data  to  the  EA  offset  of  the  ES  space  for  the  given  SPU. 

3.  The  PPU  sends  the  EA  shared  data  space  addresses  of  SPUO  to  SPUl  and  of  SPUl  to 
SPUO. 
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1 :  while  files  do 

2:  read  file 

3:  set  job  for  file 

4:  signal  job  to  SPU 

5;  wait  for  DONE  signal  from  SPU 

6;  end  while 

7:  send  DONE  signal  to  SPU 

Figure  4:  The  PPU  repeatedly  tasks  jobs  to  the  SPU  and  consumes  results  until  all  jobs  are  done,  then 
the  PPU  tasks  DONE  to  terminate. 

1 :  while  true  do 
2:  wait  for  job 

3:  break  if  job  ==  DONE 

4;  process  job 

5;  wait  for  DONE  signal  from  SPU 

6:  send  calculated  hash  back  to  EA 

7:  end  while 

Figure  5:  The  SPU  repeatedly  accepts  and  process  jobs  until  the  job  indicates  DONE. 


2.4  PPU  Task  Delegation 

The  SPUs  repeatedly  run  job  requests  in  a  loop  until  the  PPU  says  to  stop.  The  PPU  algorithm 
is  presented  in  Figure  4.  The  SPU  algorithm  is  presented  in  Figure  5. 

For  single-SPU  jobs,  the  job  request  eonsists  of  the  following: 

•  job  ID 

•  address  of  the  data 

•  count  of  data  such  as  number  of  blocks  or  number  of  files 

•  address  of  the  returned  hash  or  hashes 

For  pipelined  jobs,  the  job  request  is  similar,  but  SPUO  does  not  receive  the  address  of  the 
returned  hash  or  hashes,  and  SPUl  does  not  receive  the  address  of  the  data. 

2.5  Shared  Data  Structures 

The  PPU  and  each  SPU  contains  shared  data  structures.  Shared  data  structures  are  spaces  where 
other  processes  may  read  or  write  data.  We  use  these  shared  data  structures  in  two  ways: 

•  For  synchronization.  A  processor  signals  another  processor  via  a  DMA  Put  operation. 
For  example,  in  pipelined  hashing,  SPUO  signals  to  SPUl  that  data  is  available  by  per¬ 
forming  a  DMA  Put  into  SPUl’s  shared  data  space. 

•  For  data  transfer  between  SPUs.  An  SPU  performs  a  DMA  Get  to  obtain  data  from  the 
shared  space  of  another  SPU. 

For  example,  in  pipelined  hashing,  SPU  1  issues  a  DMA  Get  to  pull  Schedule  Data  from 
the  shared  data  space  of  SPUO. 
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Input  data  and  Hash  values  are  not  part  of  shared  data  spaee.  These  values  are  provided  at 
runtime  during  Task  Delegation. 

2.6  DMA 

Data  flows  between  SPUs  and  global  memory  via  DMA  Get  and  DMA  Put  operations. 

We  use  DMA  for  signaling  and  for  data  transfer. 

DMA  is  a  valuable  hardware  resouree  beeause  it  is  performed  by  the  MMU  asynchronously  of 
the  processor.  We  have  the  SPU  issue  a  DMA  Get,  then  perform  some  compute-intensive  work. 
By  the  time  the  compute-intensive  work  is  done,  the  DMA  Get  should  be  complete. 

An  SPU  DMAs  with  a  second  SPU  by  specifying  the  LS  address  of  its  own  data  and  the  EA  of 
the  shared  data  space  on  the  second  SPU. 

An  SPU  DMAs  with  global  memory  by  specifying  the  LS  address  of  the  SPU  and  the  EA  of 
the  global  memory. 

The  PPU  DMAs  with  an  SPU  by  placing  the  LS  address  of  the  SPU  and  the  EA  of  the  global 
memory  onto  the  SPU’s  MMU. 

The  DMA  bus  width  is  16  bytes  (128  bits),  aligned.  Although  DMA  supports  transfers  of  less 
than  16  bytes,  we  never  issue  them  because  they  introduce  complexity  and  offer  no  gain.  When 
performing  a  transfer  of  less  than  16  bytes,  the  transfer  size  must  be  a  power  of  two  and  the 
source  and  destination  bytes  must  be  aligned.  Also,  multiple  signal  channels  must  not  be  used 
within  the  same  Vector. 

The  DMA  Get  operation  is  faster  than  the  DMA  Put  operation.  Also,  tests  have  shown  large  in¬ 
termittent  pauses  (perhaps  1,000  SPU  clock  ticks)  on  large  DMA  Put  operations  (over  2  KBytes 
and  most  evident  on  16  KBytes).  Eor  these  reasons  and  for  the  simple  clarity  of  sticking  to  a 
“pull-only”  data  model,  we  always  DMA  Get  data.  Recall  that  for  signaling,  we  always  DMA 
Put  1  Vector  of  data. 

DMA  operations  support  Barriers  and  Tag  Groups.  A  Barrier  ensures  that  prior  DMA  requests 
that  are  a  member  of  a  tag  group  finish  before  future  DMA  requests  that  are  a  member  of  the 
same  tag  group  begin.  Used  together.  Barriers  and  tag  groups  permit  synchronized  DMA  flow. 

2.7  Process  Synchronization 

Synchronization  between  SPUs  and  the  PPU  is  performed  using  DMA  Put  operations  of  128  bit 
Vectors.  Signaling  can  be  boolean  or  can  contain  some  additional  information.  Eor  signaling 
availability  of  a  hash  job,  the  PPU  sends  job  information  along  with  a  job  ID  signal  when  issuing 
hash  jobs.  Eor  example  to  dispatch  a  job  to  hash  ten  blocks  of  data  via  SHA256,  the  PPU  signals 
the  SHA256  hash  job  ID,  the  EA  of  the  start  of  the  data,  ten  as  the  number  of  blocks,  and  the 
EA  of  where  to  place  the  computed  hash  value.  Eor  signaling  a  boolean  indicator  that  data  is 
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1 :  index  i  =  0 

2:  if  more  data  at  index  0  then 
3:  start  reading  into  buffer  0  tag  group  0 

4:  while  more  data  at  index  i  do 

5;  if  more  data  at  index  i+1  then 

6:  start  reading  into  buffer  (i+l)%2  tag  group  (i+l)%2 

7;  end  if 

8:  wait  for  DMA  tag  group  index  i%2 

9;  proeess  buffer  index  i%2 

10;  index  i  ++ 

1 1 ;  end  while 

12:  send  ealeulated  hash  to  PPU 

13:  end  if 

14;  send  DONE  signal  to  SPU 

Figure  6:  Double-buffering  is  used  so  that  the  MMU  is  reading  the  next  buffer  while  the  eurrent  bubffer 
is  being  proeessed. 


ready,  the  SPU  sends  SET. 

2.8  Dataflow  Synchronization 

When  synchronization  signaling  is  predicated  on  completion  of  DMA,  synchronization  actions 
may  be  simplified  using  the  DMA  Barrier  feature.  Specifically,  a  signal  request  is  placed  on  the 
DMA  queue  immediately  after  a  data  transfer  request  is  enqueued.  Both  requests  are  enqueued 
in  the  same  tag  group  and  are  separated  by  a  Barrier  so  that  the  signal  goes  out  immediately 
after  the  data  transfer  has  completed.  For  example  when  an  SPU  has  completed  calculating  a 
Hash  value,  the  SPU  enques  to  DMA  the  Hash  value  and  then  the  signal,  separated  by  a  Barrier. 

We  also  perform  dataflow  synchronization  by  blocking  on  DMA  completion  when  double¬ 
buffering.  Specifically,  we  issue  a  DMA  Get  request  to  read  buffer  n-i-1  then  begin  processing 
buffer  n.  To  be  sure  that  buffer  n  is  in,  we  block  for  buffer  n.  Unless  there  is  an  unexpected  pause 
in  processing,  the  SPU  never  blocks.  This  is  because  the  time  required  for  DMA  is  shorter  than 
the  time  required  for  processing.  We  block  on  buffer  n  and  not  on  buffer  n-i-1  by  using  separate 
tag  groups  for  each  buffer. 

2.9  Double  Buffering 

We  use  double-buffering  whenever  we  use  DMA  Get  to  obtain  data  to  be  processed.  If  more 
data  is  available,  we  start  reading  buffer  n-i-1  before  processing  buffer  n.  This  flow  is  shown  in 
Figure  6. 
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Hash 

2.27  GHz 

Algorithm 

CellBE 

Xeon 

SHA256 

205.6 

101.8 

MD5 

317.7 

371.3 

SHAl 

413.3 

129.8 

Table  1:  Measured  Best  Hash  Throughput  for  a  Single  File  in  MBytes/See 


Hash 

Algorithm 

CellBE 

4  Core 
2.27  GHz 
Xeon 

SHA256 

3,398.4 

407.2 

MD5 

10,032.0 

1,485.2 

SHAl 

7,296.0 

519.2 

Table  2:  Aggregate  Throughput  per  ehip  in  MBytes/See. 


3  Timing  Results 

Timing  results  are  eompared  between  the  CellBE  and  the  Xeon  Proeessors  on  the  IBM  blades 
that  we  have. 

Here  “Aggregate  Throughput”  is  the  measure  of  the  rate  number  of  bytes  that  ean  be  hashed  at 
one  time  using  multiple  files.  For  example,  in  the  “SHA256”  ease,  a  single  CellBE  ean  hash  32 
different  files  at  one  time. 

The  file  used  for  all  measurements  was  approx.  262M  Byes  in  length.  The  programs  used  on 
the  CellBE  were  built  to  run  on  the  Cell’s  Synergistie  Proeessing  Elements.  Different  programs 
on  the  eell  were  used  to  obtain  the  “Best  Throughput  for  a  Single  File”  and  for  the  “Aggregate 
Throughput”. 

The  Xeon  measurements  use  the  “user  time”  of  the  “sha256sum”,  “mdSsum”  and  “shalsum” 
programs.  We  did  not  write  speeial  programs.  Various  people  have  reported  rates  of  15-27 
eyeles/byte  for  sha256.  This  is  eonsistent  with  our  speed  measurements. 

The  aggregate  is  measured  on  the  CellBE  but  is  only  Projeeted  on  the  Xeon  since  we  were 
unable  to  get  all  8  cores  to  run  simultaneously  without  interference.  One  or  more  are  used  to 
support  the  OS  and  process  tasks.  We  think  all  the  measurements  only  include  a  limited  portion 
of  the  file  system  access.  We  did  not  use  the  “openssl”  versions  on  the  Xeon.  A  quick  test 
showed  that  the  “openssl”  versions  of  MD4  and  SHAl  are  marginally  faster  than  the  versions 
we  tested. 
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Hashing  Speed  (Gb/s) 


implementation 

hash  type 

who:  SPU,  file 

MD5 

SHA-I 

SHA-256 

IBM:  1  , 

I 

2.448 

2. 1 16 

0.854 

ours:  I  , 

I 

2.544 

2.604 

1.392 

ours:  2  , 

I 

N/A 

3.352 

1.653 

ours:  I  , 

4 

I0.II6 

7.335 

3.403 

Encryption  Speed  (Gb/s,  1  SPU) 


AES  mode 

keysize 

128 

192 

256 

our  CTR 

2.071 

1.722 

1.474 

IBM’s  ECB  (encr) 

2.059 

I.7I0 

1.462 

our  ECB  (encr) 

2.092 

1.737 

1.484 

IBM’s  CBC  (encr) 

0.795 

0.664 

0.570 

our  CBC  (encr) 

I.I9I 

0.989 

0.846 

Table  3:  Our  measured  speeds  for  hashing  and  AES  eneryption,  eompared  with  IBM’s  published 
results[2].  Our  hashing  results  show  speedups  from  two  different  types  of  parallelism:  using  2  SPUs 
for  a  single  file,  or  hashing  four  files  on  one  SPU. 


4  AES  on  the  SPU 

The  implementation  of  AES  GCM  on  the  SPU  eonsists  of  an  AES  implementation,  a  GCM 
implementation,  and  a  data  pump  state  machine  that  schedules  data  transfer  from  main  mem¬ 
ory  to  SDRAM,  performs  encryption,  and  then  schedules  transfer  back  from  SDRAM  to  main 
memory. 

While  our  main  interest  is  in  the  Counter  (CTR)  mode  of  AES,  as  part  of  the  Galois  Counter 
Mode,  we  also  implemented  the  Electronic  CodeBook  (ECB)  mode  and  Cipher  Block  Chaining 
(CBC)  feedback  mode  for  comparison;  the  data  throughput  speeds  for  a  single  SPU  (encryption 
only)  are  summarized  in  Table  3.  Note  that  our  speeds  for  ECB  are  slightly  faster  than  IBM’s, 
and  our  speeds  for  CBC  are  significantly  better. 
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Figure  7 :  Our  test  architecture  for  failsafe  AES  encryption. 


Figure  8:  An  alternative  architecture  for  failsafe  AES  encryption.  This  architecture  provides  additional 
reliahility  against  incorrect  operation  but  does  not  place  hard  limits  between  the  movement  of  unen¬ 
crypted  data  through  the  SPU  to  the  output  buffer. 


5  Parallel  Architectures  with  AES 

In  our  work  with  AES  Encryption,  we  developed  two  approaches  for  utilizing  on-chip  paral¬ 
lelism  provided  by  multiple  SPUs  for  detecting  Single  Event  Upsets  (SEU).  Our  motivation  for 
detecting  SEUs  is  to  ensure  that  unencrypted  data  never  flows  to  the  output  buffer. 

Eigure  7  shows  our  first  proposal  for  creating  a  network  for  SPUs  to  improve  reliability.  Data  is 
transferred  from  main  memory  to  both  SPU  1  and  SPU  2.  Encrypted  data  is  sent  from  the  two 
SPUs  to  SPU  3  where  it  is  compared  and,  if  validated,  sent  to  the  output  buffer.  The  advantage 
of  this  approach  is  that  no  SEU  in  SPUl,  SPU2  or  SPUS  can  result  in  cleartext  data  being  copied 
to  the  output  buffer.  The  disadvantage  is  that  a  SEU  in  SPUS  can  nevertheless  result  in  incorrect 
data  being  output,  and  there  is  an  equal  chance  of  a  failure  in  SPUl,  SPU2  and  SPUS. 

An  alternative  architecture  is  shown  in  Eigure  8. 
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6  Conclusion 

In  this  report  we  have  shown  how  the  CellBE  arehiteeture  ean  deliver  impressive  performanee 
in  hashing  and  eneryption  through  the  use  of  eustom-written  VLIW  assembly  eode  and  through 
eareful  attention  to  buffer  management.  Although  the  Cell  has  a  reputation  as  being  hard  to 
program,  we  did  not  find  this  to  be  the  ease.  Certainly  the  Cell  had  a  learning  eurve,  but  we 
were  able  to  master  the  eomplexities  of  this  arehiteeture  in  a  relatively  short  amount  of  time.  It 
is  unfortunate  that  IBM  halted  development  on  the  Cell  arehiteeture,  as  it  elearly  represented  a 
high-performanee,  eost-effeetive  eomputing  platform. 
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A  Introduction 

We  provide  two  optimizations  of  high  speed  SHA-256[5],  MD5[6],  and  SHA-1[5]  hashing  on 
the  Cell  BE: 

1 .  Maximum  throughput,  achieved  by  hashing  four  separate  datasets  concurrently  in  lanes 
on  one  or  more  SPUs,  and 

2.  Maximum  single-file  throughput  with  minimal  latency,  where  multiple  SPUs  participate 
in  calculating  various  hashes  using  pipelined  processing. 

Six  hash  outputs  are  available:  SHA-256,  MD5,  and  SHA-1  file  hashes  and  SHA-256,  MD5, 
and  SHA-1  4  KByte  Sector  hashes  of  files. 

Both  programs  run  from  the  shell  and  use  SPU  resources. 

This  document  describes  the  following: 

•  This  introduction  to  hashing. 

•  An  overview  of  hash_f  iles  and  hash_f  ile  hashing  programs. 

•  A  description  of  the  NPS  Cell  PPU  Hash  Utilities. 

•  A  description  of  the  NPS  Cell  SPU  Hash  Utilities. 

•  Details  of  the  hash.files  application. 

•  Details  of  the  hash_f  ile  application. 
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B  Application  Overview 

B.l  hash  files 

Application  hash_files  provides  maximum  total  throughput  using  lane  processing.  Lane 
processing  hashes  four  separate  datasets  in  four  lanes  concurrently  on  an  SPU.  Multiple  SPUs 
may  be  active  hashing  files  simultaneously.  Application  hash_f  iles  requires  the  following 
hashing  utilities: 

•  From  NPS  Cell  PPU  Hash  Utilities: 

-  Lane  context  modules. 

-  File  context  modules. 

-  Data  context  modules. 

•  From  NPS  Cell  SPU  Hash  Utilities: 

-  Modules  that  hash  files  and  Sectors  in  lanes,  and,  for  finalization,  modules  that 
perform  a  single  optimized  hash  on  one  SPU. 

B.2  hash.file 

Application  hash_f  ile  provides  maximum  single-file  throughput  by  pipelining  hashing  tasks 
across  multiple  SPUs.  SHA-256  and  SHA-1  hashing  is  split  into  parts  and  is  distributed  across 
multiple  pipelining  SPUs.  MD5  and  Sector  hashing  tasks  are  split  across  multiple  SPUs.  Ap¬ 
plication  hash_f  ile  requires  the  following  hashing  utilities: 

•  From  NPS  Cell  PPU  Hash  Utilities: 

-  Pipeline  context  modules. 

-  File  context  modules. 

-  Data  context  modules. 

•  From  NPS  Cell  SPU  Hash  Utilities: 

-  Modules  that  hash  SHA-256  and  SHA-1  in  pipelined  fashion,  modules  that  hash 
MD5  as  a  single  thread,  and  modules  that  hash  sectors  in  lanes. 
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C  NPS  Cell  PPU  Hash  Utilities 

PPU  hash  utilities  provide  interfaees  for  hashing  and  interfaees  for  managing  files  and  data  to 
be  hashed.  The  NPS  Cell  PPU  Hash  Utilities  are  deseribed  in  MAN  pages. 

PPU  hash  utilities  are  defined  in  the  following  interfaee  files: 

•  lanes _h ash. h  defines  interfaees  and  type  deelarations  for  hashing  with  maximum 
throughput  on  SPUs  using  lane  proeessing. 

•  pipe  _h  ash. h  defines  interfaees  and  type  deelarations  for  hashing  with  maximum  single¬ 
file  throughput  on  two  SPUs  using  pipelined  processing. 

•  single_md5_hash  .  h  defines  interfaces  and  type  declarations  for  hashing  MD5  with 
maximum  single-file  throughput  on  one  SPU. 

•  shared-Context .  h  defines  general  interfaces,  type  declarations,  and  constants  for 
supporting  data  and  file  management  involved  in  hashing. 

C.l  lanes  hash .  h  Lane  Hash  Interfaces 

These  interfaces  provide  SPU  hashing  resources  which  use  lane  processing  to  achieve  maximum 
total  throughput. 

C.1.1  Lane  Interfaces 

•  lanes_open_spu  ( )  opens  an  SPU  resource  for  hashing  lanes.  Specifically:  memory  is 
allocated  for  the  lanes  SPU  context,  the  SPE  context  is  created,  the  lanes.spu  program 

is  loaded,  the  SPU  thread  is  started,  and  the  SPU’s  local  store  address  ls_lanes_shared_data 
is  obtained  from  the  started  SPU  so  that  the  PPU  can  communicate  to  the  SPU. 

•  lanes_close_spu  ( lanes_spu_ctx_ptr )  closes  an  SPU  resource  for  hashing  lanes. 

Specifically:  the  SPU  is  signalled  to  stop,  the  SPU  thread  is  closed,  the  SPE  context  is 
closed,  and  memory  for  the  lanes  SPU  context  is  deallocated. 

•  lanes_hash  ( lanes_spu_ctx_pt r ,  data_ctx_ptrl ,  data_ctx_ptr2 ,  data_ctx_ptr3 , 
data_ctx_ptr4 ,  hash_f  lags )  hashes  four  datasets  residing  in  four  data  contexts 

using  the  requested  hash  algorithms. 

•  lanes_poll_spu  ( lanes _spu_ctx_ptr)  identifies  when  hashing  is  done  and  tran¬ 
sitions  data  state  when  hashing  is  done. 

C.1.2  PPU  to  SPU  Signal  Vector 

The  PPU  signals  to  the  SPU  that  a  hashing  job  is  ready  using  the  following  signal  Vector: 


typedef  struct  lanes_ppu_to_spu_t  { 

volatile  unsigned  int  signal  _ attribute _  ((aligned  (16))); 

volatile  unsigned  int  hash_flags; 

unsigned  int  dummy [2];  //  alignment  to  16-bytes  for  block  DMA  transfer 
}  lanes_ppu_to_spu_t ; 
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•  Field  signal  signals  the  SPU  that  a  hash  job  is  available.  The  PPU  sends  either 
SPU  SIGNAL  to  indicate  that  a  job  is  available  or  SPUIDONE  to  indicate  that  there 
are  no  more  jobs  and  that  the  SPU  should  exit.  The  SPU  resets  its  own  signal  value  to 
SPU_NO_SIGNAL  to  indicate  that  the  job  has  been  serviced. 

•  Field  hash.f  lags  provides  hash  job  bit  fields  for  SHA-256,  MD5,  and  SHA-1  file  and 
Sector  hashes. 

•  Field  dummy  [2  ]  fills  out  the  signal  Vector  to  16  bytes  for  a  full  16-byte  Vector  DMA 
transfer. 

C.1.3  SPU  to  PPU  signal  Vector 

When  the  SPU  has  completed  hash  processing,  it  signals  the  PPU  that  the  hashing  job  is  com¬ 
pleted  using  the  SPU  to  PPU  signal  Vector. 


typedef  struct  lanes_spu_to_ppu_t  { 

volatile  unsigned  Int  signal  _ attribute _  ((aligned  (16))); 

volatile  unsigned  Int  tlck_start;  //  timer 
volatile  unsigned  Int  tlck_f Inlsh;  //  timer 

unsigned  Int  dummy [ 1 ] ;  //  alignment  to  16-bytes  for  block  DMA  transfer 
}  1 ane  s_spu_t  o_ppu_t ; 

•  Field  signal  signals  the  PPU  that  a  hash  job  has  completed.  The  SPU  sends  SPU .AVAILABLE 
to  indicate  that  it  is  ready  for  a  hash  job.  The  PPU  presets  its  own  signal  value  to 

SPU  NOT vWAIL ABLE  to  await  the  upcoming  completion  signal  from  the  SPU. 

•  If  the  SPU  code  is  compiled  with  the  __TIMING__cfiag  switch  set,  then  the  tlck.start 
and  tick  .finish  fields  will  contain  SPU  tick  start  and  stop  timing  information. 

•  Lield  dummy  [  1  ]  fills  out  the  signal  Vector  to  16  bytes  for  a  full  16-byte  Vector  DMA 
transfer. 

C.1.4  PPU  Shared  Data  Structure 

The  PPU  reserves  the  following  data  space  for  communicating  with  the  SPU.  The  SPU  receives 
a  pointer  to  this  structure  when  it  is  initialized  so  that  it  can  read  to  and  write  from  it. 

typedef  struct  ppu_lanes_shared_data_t  { 

1 ane  s_ppu_t  o_spu_t  s 1 gna l_t  o_spu ; 

1 ane  s_spu_t  o_ppu_t  s 1 gna l_t  o_ppu ; 

data_ctx_ptr_t  data_ctx_ptr [ 4 ]  _ attribute _  ((aligned  (16))); 

}  ppu_lanes_shared_dat a_t ; 

•  Lield  slgnal.to.spu  contains  the  Vector  for  signaling  to  the  SPU. 

•  Lield  slgnal.to.ppu  contains  the  Vector  for  receiving  signaling  from  the  PPU. 

•  Lield  dat  a.ctx.ptr  [  4  ]  contains  the  four  pointers  to  the  four  data  contexts  containing 
data  to  be  hashed.  The  SPU  reads  these  four  pointers. 
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C.1.5  SPU  Shared  Data  Structure 

The  SPU  Shared  Data  Strueture  is  the  SPU’s  data  spaee  for  eommunicating  with  the  PPU.  When 
the  SPU  initializes,  it  sends  a  pointer  to  this  structure  to  the  PPU  so  that  the  PPU  knows  where 
to  signal  to  the  SPU. 


typedef  struct  spu_lanes_shared_data_t  { 
lanes_ppu_to_spu_t  signal_to_spu; 
lanes_spu_to_ppu_t  signal_to_ppu; 

}  spu_lanes_shared_data_t ; 

•  Field  signal_to_spu  contains  the  Vector  for  receiving  signaling  from  the  PPU. 

•  Field  signal_to_ppu  contains  the  Vector  for  signaling  to  the  PPU. 

C.1.6  Lanes  SPU  Context  Structure 

The  Lanes  SPU  Context  tracks  the  SPU  and  the  process  thread  that  is  associated  with  it,  and 
holds  information  for  communicating  between  the  PPU  and  SPU. 

typedef  struct  lanes_spu_ctx_t  { 

spe_context_ptr_t  spe_ctx;  //  pointer  to  SPU' s  SPE  context 
pthread_t  spu_thread;  //  pthread 

unsigned  int  spu_index;  //  used  for  SPU  tracking  in  printf 
ppu_lanes_shared_data_t  ppu_shared_data;  //  shareable  data  on  PPU 
spu_lanes_shared_data_t  *ls_spu_shared_data;  //  LS  of  shareable  data  oi 
}  lanes_spu_ctx_t ; 

•  Field  spe_ctx  points  to  the  SPE  context  which  identifies  the  associated  SPU. 

•  Field  spu_thread  identifies  the  PPU  thread  under  which  the  associated  SPU  runs. 

•  Field  spu.index  indexes  the  SPU  for  diagnostics  purposes  and  is  otherwise  unused. 

•  Field  ppu_shared_data  contains  the  shared  data  structure  on  the  PPU  side  that  is 
associated  with  the  lanes  SPU.  Data  within  this  structure  is  shared  with  the  SPU. 

•  Field  ls_spu_shared_data  contains  the  local  store  address  of  the  shared  data  struc¬ 
ture  on  the  SPU  side  that  is  associated  with  the  lanes  SPU.  The  PPU  uses  this  pointer  to 
communicate  to  the  SPU. 

C.2  pipe -hash .  h  Pipelined  Hash  Interfaces 

These  interfaces  provide  SPU  hashing  resources  which  use  pipelined  processing  of  SHA-256 
or  SHA-1  to  achieve  maximum  single-file  throughput. 

C.2.1  Pipeline  Interfaces 

•  pipe_open_spu  ()  opens  an  SPU  resource  for  hashing  SHA-256  or  SHA-1  using  pipelined 
processing  across  two  SPUs.  Specifically:  memory  is  allocated  for  the  pipeline  SPU  con¬ 
text,  SPE  contexts  are  created,  SPU  programs  pipel_spu  and  pipe2_spu  are  loaded, 
the  two  SPU  threads  are  started,  the  SPU’s  local  store  addresses  are  obtained  from  the 
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started  SPU  so  that  the  PPU  can  communicate  to  the  SPU,  and  the  SPU’s  local  store  ad¬ 
dresses  are  exchanged  between  the  SPUs  so  that  the  SPUs  can  communicate  with  each 
other. 

•  pipe_close_spu  (pipe_spu_ctx_ptr )  closes  an  SPU  resource  used  for  pipelined 
hashing.  Specifically:  the  SPUs  are  signalled  to  stop,  the  SPU  threads  are  closed,  the  SPE 
contexts  are  closed,  and  memory  for  the  pipeline  SPU  contexts  are  deallocated. 

•  pipe_hash  (pipe_spu_ctx_ptr ,  data_ctx_ptr ,  hash_f  lags )  hashes  a  dataset 
using  the  requested  SHA-256  or  SHA-1  algorithm. 

•  pipe_poll_spu  (pipe_spu_ctx_ptr )  identifies  when  hashing  is  done  and  transi¬ 
tions  data  state  when  hashing  is  done. 

C.2.2  PPU  to  Pipe  Signal  Vectors 

The  PPU  uses  the  following  signal  Vector  to  signal  both  pipeline  SPUs  that  a  hashing  job  is 
ready: 


typedef  struct  ppu_to_pipe_t  { 

volatile  unsigned  int  signal  _ attribute _  ((aligned  (16))); 

volatile  unsigned  int  hash_flags; 
volatile  data_ctx_ptr_t  data_ctx_ptr; 

unsigned  int  dummy [ 1 ] ;  //  alignment  to  16-bytes  for  block  DMA  transfer 
}  PPu_to_pipe_t ; 

•  Field  signal  signals  the  SPU  that  a  hash  job  is  available.  The  PPU  sends  either 
SPU  SIGNAL  to  indicate  that  a  job  is  available  or  SPUIDONE  to  indicate  that  there 
are  no  more  jobs  and  that  the  SPU  should  exit.  The  SPU  resets  its  own  signal  value  to 
SPU_NO_SIGNAL  to  indicate  that  the  job  has  been  serviced. 

•  Field  hash_f  lags  provides  a  hash  job  code  for  SHA-256  or  SHA-1,  and  is  FILE_SHA256 
or  FILE  SHAl. 

•  Field  data_ctx_pt  r  points  to  the  data  context  for  the  data  being  hashed. 

•  Field  dummy  [  1  ]  fills  out  the  signal  Vector  to  16  bytes  to  allow  a  full  16-byte  Vector 
DMA  transfer. 

C.2.3  SPU  to  PPU  signal  Vector 

When  the  SPU  has  completed  hash  processing,  it  signals  the  PPU  that  the  hashing  job  is  com¬ 
pleted  using  the  SPU  to  PPU  signal  Vector.  Both  SPUs  provide  this  signal  when  they  have 
completed  their  respective  hashing  role,  specifically,  when  pipel  finishes  providing  message 
schedule  data  to  pipe2,  and  when  pipe2  finishes  hashing  message  schedule  data. 

typedef  struct  pipe_to_ppu_t  { 

volatile  unsigned  int  signal  _ attribute _  ((aligned  (16))); 

volatile  unsigned  int  tick_start;  //  timer 
volatile  unsigned  int  tick_finish;  //  timer 
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unsigned  int  dummy [ 1 ] ;  //  alignment  to  16-bytes  for  block  DMA  transfer 
}  pipe_to_ppu_t ; 

•  Field  signal  signals  the  PPU  that  a  hash  job  has  eompleted.  The  SPU  sends  SPUuWAILABLE 
to  indieate  that  it  is  ready  for  a  hash  job.  The  PPU  presets  its  own  signal  value  to 

SPU _NOT .AVAILABLE  to  await  the  upeoming  eompletion  signal  from  the  SPU. 

•  If  the  SPU  eode  is  eompiled  with  the  __TIMING__eflag  switeh  set,  then  the  tick.start 
and  tick  .finish  fields  will  eontain  SPU  tiek  start  and  stop  timing  information. 

•  Lield  dummy  [  1  ]  fills  out  the  signal  Veetor  to  16  bytes  for  a  full  16-byte  Veetor  DMA 
transfer. 


C.2.4  SPU  to  SPU  signal  Vector 

The  pipel  SPU  signals  the  pipe2  SPU  when  the  pipel  SPU  has  eompleted  preparing  sehedule 
data  that  is  available  for  the  pipe2  SPU  to  eonsume.  Two  signal  ehannels  are  used  beeause  two 
buffers  of  data  are  prepared.  By  using  double-buffering,  one  buffer  may  be  eonsumed  while  the 
other  is  being  prepared. 

The  pipe2  SPU  signals  the  pipel  SPU  when  it  has  eompleted  pulling  sehedule  data  from  the 
pipel  SPU.  When  the  pipel  SPU  reeeives  this  signal,  it  is  free  to  replaee  the  buffer  with  the 
next  bufferfull  of  sehedule  data.  Two  signal  ehannels  are  used  beeause  sehedule  data  is  double- 
buffered. 

The  SPU  to  SPU  signal  data  strueture  follows. 


typedef  struct  pipe_spu_to_spu_t  { 

volatile  unsigned  int  signal  _ attribute _  ((aligned  (16))); 

unsigned  int  dummy [3];  //  alignment  to  16-bytes  for  block  DMA  transfer 
}  pipe_spu_to_spu_t ; 

•  Lield  signal  signals  the  other  SPU  that  the  resouree  is  available  (on  pipel)  or  has  been 
eonsumed  (by  pipe2). 

•  Lield  dummy  [  3  ]  fills  out  the  signal  Veetor  to  16  bytes  for  a  full  16-byte  Veetor  DMA 
transfer. 

The  SPU  to  SPU  signal  values  follow: 


//  signal  between  SPUs 
#define  CLEARED  0 
#define  SET  1 

•  SET  indieates  an  aetive  signal. 

•  CLEARED  is  written  on  the  loeal  SPU  to  elear  the  SET  signal  sent  by  the  other  SPU. 
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C.2.5  PPU  Shared  Data  Structure 

The  PPU  reserves  the  following  data  space  for  communicating  with  the  pipel  and  pipe2  SPUs. 

The  SPU  receives  a  pointer  to  the  two  SPU  shared  structures  when  it  is  initialized  so  that  it  can 
read  to  and  write  from  them. 

typedef  struct  ppu_pipe_shared_data_t  { 
ppu_to_pipe_t  ppu_to_pipe; 
pipe_to_ppu_t  pipel_to_ppu; 
pipe_to_ppu_t  pipe2_to_ppu; 

}  ppu_pipe_shared_data_t ; 

•  Field  ppu_to_pipe  contains  the  Vector  for  signaling  to  the  SPUs. 

•  Field  pipel_to_ppu  contains  the  Vector  for  receiving  signaling  from  the  pipel  SPU. 

•  Field  pipe2_t  o_ppu  contains  the  Vector  for  receiving  signaling  from  the  pipe2  SPU. 

C.2.6  Pipel  Shared  Data  Structure 

The  pipel  Shared  Data  Structure  is  pipeline  I’s  SPU  data  space  for  communicating  with  the 
PPU  and  with  the  other  SPU.  When  the  SPU  initializes,  it  shares  a  pointer  to  this  structure  to 
the  PPU  and  the  other  SPU  so  that  they  know  where  to  signal  to  it. 

typedef  struct  spu_pipel_shared_data_t  { 

SPU_PIPE2_SHARED_DATA_T  *ea_pipe2_shared_data; 
ppu_to_pipe_t  ppu_to_pipel ; 
pipe_to_ppu_t  pipel_to_ppu; 
pipe_spu_to_spu_t  pipel_to_pipe2 ; 
pipe_spu_to_spu_t  pipe2_to_pipel [2] ; 

unsigned  char  data [2] [BUEEER_SIZE]  _ attribute _  ((aligned  (16))); 

unsigned  char  schedule_data [ 2 ] [SCHEDULE_BUEEER_SIZE]  _ attribute _  ( (a^ 

}  spu_pipel_shared_data_t ; 

•  Field  ea_pipe2_shared_data  contains  the  Effective  Address  (EA)  pointer  to  the 
pipe2  shared  data  structure  so  that  the  pipel  SPU  can  DMA  Put  signaling  to  pipe2. 

•  Eield  ppu_to_pipel  contains  the  signaling  Vector  the  PPU  uses  to  signal  to  pipel. 

•  Eield  pipel_to_pipe2  contains  the  signal  Vector  pipel  uses  to  communicate  to  pipe2, 
and  is  constant. 

•  Eield  pipe2_to_pipel  [  2  ]  contains  the  two  signal  Vectors  pipe2  uses  to  communicate 
to  pipel. 

•  Eield  data  [2]  [BUEEER.SIZE]  contains  data  to  be  hashed,  and  is  double-buffered. 

Pipel  uses  DMA  Get  to  prefetch  this  data.  Pipel  does  not  preprocess  this  data  into  sched¬ 
ule  data  until  cleared  by  a  signal  from  pipe2  to  do  so. 

•  Eield  schedule_data  [  2  ]  [SCHEDULE_BUEEER_SIZE]  contains  double-buffered  sched¬ 
ule  data.  Pipe2  uses  DMA  Get  to  read  this  schedule  data  after  pipel  signals  that  this 
schedule  data  is  available. 
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C.2.7  Pipe2  Shared  Data  Structure 

The  pipe2  Shared  Data  Structure  is  pipeline  2’s  SPU  data  space  for  communicating  with  the 
PPU  and  with  the  other  SPU.  When  the  SPU  initializes,  it  shares  a  pointer  to  this  structure  to 
the  PPU  and  the  other  SPU  so  that  they  know  where  to  signal  to  it. 

typedef  struct  spu_pipe2_shared_data_t  { 

SPU_PIPE1_SHARED_DATA_T  *ea_pipel_shared_data; 
ppu_to_pipe_t  ppu_to_pipe2 ; 
pipe_to_ppu_t  pipe2_to_ppu; 
pipe_spu_to_spu_t  pipe2_to_pipel ; 
pipe_spu_to_spu_t  pipel_to_pipe2 [2] ; 

sha256_hash_t  sha256_hash  _ attribute _  ((aligned  (16))); 

shal_hash_t  shal_hash  _ attribute _  ((aligned  (16))); 

unsigned  char  schedule_data [ 2 ] [SCHEDULE_BUEEER_SIZE]  _ attribute _  ( (a^ 

}  spu_pipe2_shared_data_t ; 

•  Field  ea_pipel_shared_data  contains  the  Effective  Address  (EA)  pointer  to  the 
pipel  shared  data  structure  so  that  the  pipe2  SPU  can  DMA  Put  signaling  to  pipel. 

•  Eield  ppu_to_pipe2  contains  the  signaling  Vector  the  PPU  uses  to  signal  to  pipe2. 

•  Eield  pipe2_to_pipel  contains  the  signal  Vector  pipe2  uses  to  communicate  to  pipel, 
and  is  constant. 

•  Eield  pipel_to_pipe2  [  2  ]  contains  the  two  signal  Vectors  pipel  uses  to  communicate 
to  pipe2. 

•  Eield  schedule_data  [  2  ]  [SCHEDULE_BUEEER_SIZE]  contains  double-buffered  sched¬ 
ule  data.  Pipe2  uses  DMA  Get  to  read  this  schedule  data  from  pipel  after  pipel  signals 
that  this  schedule  data  is  available. 

C.2.8  Pipe  SPU  Context  Structure 

The  Pipe  SPU  Context  tracks  the  SPUs  and  process  threads  that  are  associated  with  them,  and 
holds  information  for  communicating  between  the  PPU  and  SPU. 

typedef  struct  pipe_spu_ctx_t  { 

spe_context_ptr_t  spe_pipel_ctx;  //  pointer  to  pipel  SPU' s  SPE  context 
spe_context_ptr_t  spe_pipe2_ctx;  //  pointer  to  pipe2  SPU' s  SPE  context 
pthread_t  spu_pipel_thread;  //  pthread 
pthread_t  spu_pipe2_thread;  / /  pthread 

ppu_pipe_shared_data_t  ppu_shared_data;  //  shareable  data  on  PPU 
spu_pipel_shared_data_t  *ls_pipel_shared_data;  //  LS  of  shareable  data 
spu_pipe2_shared_data_t  *ls_pipe2_shared_data;  //  LS  of  shareable  data 
}  pipe_spu_ctx_t ; 

•  Eield  spe_pipel_ctx  points  to  the  pipel  SPE  context. 

•  Eield  spe_pipe2_ctx  points  to  the  pipe2  SPE  context. 
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•  Field  spu_pipel_thread  identifies  the  PPU  thread  under  whieh  the  pipel  SPU  runs. 

•  Field  spu_pipe2_thread  identifies  the  PPU  thread  under  whieh  the  pipe2  SPU  runs. 

•  Field  ppu_shared_data  eontains  the  shared  data  strueture  on  the  PPU  side  that  is 
assoeiated  with  the  pipeline  SPUs. 

•  Field  ls_pipel_shared_data  eontains  theloeal  store  address  of  the  pipel  SPU  shared 
data  structure. 

•  Field  ls_pipe2_shared_data  contains  theloeal  store  address  of  thepipe2  SPU  shared 
data  structure.  The  PPU  uses  this  pointer  to  communicate  to  the  SPU. 

C.3  single_md5_hash .  h  Lane  Hash  Interfaces 

These  interfaces  provide  SPU  hashing  resources  which  use  lane  processing  to  achieve  maximum 

total  throughput. 

C.3.1  Single  MD5  Hash  Interfaces 

•  md5_open_spu  ( )  opens  an  SPU  resource  for  hashing  MD5.  Specifically:  memory  is  al¬ 
located  for  the  SPU  context,  the  SPE  context  is  created,  the  md5_spu  program  is  loaded, 
the  SPU  thread  is  started,  and  the  SPU’s  local  store  address  ls_spu_shared_data  is 
obtained  from  the  started  SPU  so  that  the  PPU  can  communicate  to  the  SPU. 

•  md5_close_spu  (md5_spu_ctx_ptr)  closes  an  SPU  resource  for  MD5  hashing.  Specif¬ 
ically:  the  SPU  is  signalled  to  stop,  the  SPU  thread  is  closed,  the  SPE  context  is  closed, 
and  memory  for  the  SPU  context  is  deallocated. 

•  md5_hash  (md5_spu_ctx_ptr ,  data_ctx_ptr)  performs  MD5  hashing. 

•  md5_poll_spu  (md5_spu_ctx_ptr )  identifies  when  hashing  is  done  and  transitions 
data  state  when  hashing  is  done. 

C.3.2  PPU  to  SPU  Signal  Vector 

The  PPU  signals  to  the  SPU  that  a  hashing  job  is  ready  using  the  following  signal  Vector: 


typedef  struct  md5_ppu_to_spu_t  { 

volatile  unsigned  int  signal  _ attribute _  ((aligned  (16))); 

volatile  data_ctx_ptr_t  data_ctx_ptr; 

unsigned  int  dummy [2];  //  alignment  to  16-bytes  for  block  DMA  transfer 
}  md5_ppu_to_spu_t ; 

•  Eield  signal  signals  the  SPU  that  a  hash  job  is  available.  The  PPU  sends  either 
SPU.SIGNAE  to  indicate  that  a  job  is  available  or  SPU_DONE  to  indicate  that  there 
are  no  more  jobs  and  that  the  SPU  should  exit.  The  SPU  resets  its  own  signal  value  to 
SPU  NO  SIGNAE  to  indicate  that  the  job  has  been  serviced. 

•  Eield  dat  a_ctx_pt  r  points  to  the  data  context  for  the  data  being  hashed. 

•  Eield  dummy  [2  ]  fills  out  the  signal  Vector  to  16  bytes  for  a  full  16-byte  Vector  DMA 
transfer. 
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C.3.3  SPU  to  PPU  signal  Vector 

When  the  SPU  has  completed  hash  processing,  it  signals  the  PPU  that  the  hashing  job  is  com¬ 
pleted  using  the  SPU  to  PPU  signal  Vector. 


typedef  struct  md5_spu_to_ppu_t  { 

volatile  unsigned  int  signal  _ attribute _  ((aligned  (16))); 

volatile  unsigned  int  tick_start;  //  timer 
volatile  unsigned  int  tick_f inish;  //  timer 

unsigned  int  dummy [ 1 ] ;  //  alignment  to  16-bytes  for  block  DMA  transfer 
}  md5_spu_to_ppu_t ; 

•  Field  signal  signals  the  PPU  that  a  hash  job  has  completed.  The  SPU  sends  SPU .AVAILABLE 
to  indicate  that  it  is  ready  for  a  hash  job.  The  PPU  presets  its  own  signal  value  to 

SPU  NOT vWAIL ABLE  to  await  the  upcoming  completion  signal  from  the  SPU. 

•  If  the  SPU  code  is  compiled  with  the  __TIMING__cfiag  switch  set,  then  the  tick.start 
and  tick  .finish  fields  will  contain  SPU  tick  start  and  stop  timing  information. 

•  Eield  dummy  [  1  ]  fills  out  the  signal  Vector  to  16  bytes  for  a  full  16-byte  Vector  DMA 
transfer. 

C.3.4  PPU  Shared  Data  Structure 

The  PPU  reserves  the  following  data  space  for  communicating  with  the  SPU.  The  SPU  receives 
a  pointer  to  this  structure  when  it  is  initialized  so  that  it  can  read  to  and  write  from  it. 

typedef  struct  ppu_md5_shared_data_t  { 
md5_ppu_to_spu_t  signal_to_spu; 
md5_spu_to_ppu_t  signal_to_ppu; 

}  ppu_md5_shared_dat a_t ; 

•  Eield  signal.to.spu  contains  the  Vector  for  signaling  to  the  SPU. 

•  Eield  signal.to.ppu  contains  the  Vector  for  receiving  signaling  from  the  PPU. 

C.3.5  SPU  Shared  Data  Structure 

The  SPU  Shared  Data  Structure  is  the  SPU’s  data  space  for  communicating  with  the  PPU.  When 
the  SPU  initializes,  it  sends  a  pointer  to  this  structure  to  the  PPU  so  that  the  PPU  knows  where 
to  signal  to  the  SPU. 


typedef  struct  spu_md5_shared_data_t  { 
md5_ppu_to_spu_t  signal_to_spu; 
md5_spu_to_ppu_t  signal_to_ppu; 

md5_hash_t  md5_hash  _ attribute _  ((aligned  (16))); 

unsigned  char  data [2] [BUFFER_SIZE]  _ attribute _  ((aligned  (16))); 

}  spu_md5_shared_data_t ; 
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•  Field  signal_to_spu  contains  the  Vector  for  receiving  signaling  from  the  PPU. 

•  Field  signal_to_ppu  contains  the  Vector  for  signaling  to  the  PPU. 

•  Field  md5_hash  holds  the  hash  value  being  hashed.  It  is  loaded  from  the  data  context 
space  via  DMA  Get  before  hashing  starts  and  is  used  while  computing  intermediate  hash 
values.  It  is  loaded  back  to  the  data  context  space  via  DMA  Put  after  hashing  is  updated. 

•  Field  data  [ 2  ]  [BUFFER.SIZE]  contains  data  to  be  hashed,  and  is  double-buffered. 


C.3.6  MD5  SPU  Context  Structure 

The  MD5  SPU  Context  tracks  the  SPU  and  the  process  thread  that  is  associated  with  it,  and 
holds  information  for  communicating  between  the  PPU  and  SPU. 


typedef  struct  md5_spu_ctx_t  { 

spe_context_ptr_t  spe_ctx;  //  pointer  to  SPU' s  SPE  context 
pthread_t  spu_thread;  //  pthread 

ppu_md5_shared_data_t  ppu_shared_data;  //  shareable  data  on  PPU 
spu_md5_shared_data_t  *ls_spu_shared_data;  //  LS  of  shareable  data  on 
}  md5_spu_ctx_t ; 

•  Field  spe.ctx  points  to  the  SPE  context  which  identifies  the  associated  SPU. 

•  Field  spu.thread  identifies  the  PPU  thread  under  which  the  associated  SPU  runs. 

•  Field  ppu_shared_data  contains  the  shared  data  structure  on  the  PPU  side  that  is 
associated  with  the  SPU.  Data  within  this  structure  is  shared  with  the  SPU. 

•  Field  ls_spu_shared_data  contains  the  local  store  address  of  the  shared  data  struc¬ 
ture  on  the  SPU  side  that  is  associated  with  the  SPU.  The  PPU  uses  this  pointer  to  com¬ 
municate  to  the  SPU. 

C.4  shared.context .  h  Shared  Interfaces 

These  interfaces  manage  the  opening,  reading,  and  closing  of  files  being  hashed.  Actual  data 
read  is  maintained  by  data  interfaces. 

C.4.1  File  Interfaces 

•  flle_open  (filename)  creates  a  file  context  and  opens  a  file  for  reading.  Initial 
count  values  and  starting  file  hash  values  are  set. 

•  f  lle_read_sectors  ( f lle_ctx_ptr ,  data_ctx_ptr )  reads  file  data  from  the 
file  identified  by  the  file  context  into  space  in  the  data  context.  The  number  of  bytes  read 
and  the  running  number  of  bytes  read  are  set.  The  data  context  must  be  available  for  this 
operation  to  work.  Specifically,  the  data  context  state  must  be  DATA_EMPTY. 

•  flle_close  ( f  lle_ctx_ptr )  closes  a  file  that  has  been  opened  for  reading  and  deal¬ 
locates  memory  assigned  to  the  file  context.  An  error  is  returned  if  a  data  context  is  still 
associated  with  the  file. 


25 


C.4.2  Data  Interfaces 

These  interfaces  provide  data  management  support.  Data  is  read  into  data  resources  using  file 
interfaces  and  is  hashed  using  SPU  hashing  interfaces. 

•  data.alloc  ( sectors_per_read)  allocates  a  data  context  for  reading  files  and  hash¬ 
ing  data,  sets  read  values  to  zero,  and  sets  the  data  context  state  to  DATA_EMPTY. 

•  data_free  (data_ctx_ptr )  deallocates  a  data  context  and  associated  memory  space. 
An  error  is  returned  if  a  data  context  is  still  active. 

•  data_clear_hashed_state  (data_ctx_ptr )  disassociates  any  file  context  from 
the  data  context,  sets  read  values  to  zero,  and  sets  the  data  context  state  to  DATA_EMPTY. 
An  error  is  returned  if  the  data  context  is  still  active,  which  is  the  case  when  the  data 
context  state  is  DATA JREAD  or  DATAJdASHING. 


C.4.3  Hash  sizes  and  Types 

These  declarations  define  the  fundamental  SHA-256,  MD5,  and  SHA-1  hash  types. 


//  hash  constants 

#define  SHA256_HASH_SIZE  32  //  256  bits 
#define  MD5_HASH_SIZE  16  //  128  bits 
#define  SHA1_HASH_SIZE  32  //  160  bits 

/ /  types  for  hashes 

typedef  unsigned  char  sha256_hash_t [SHA256_HASH_SIZE] ; 
typedef  unsigned  char  md5_hash_t [MD5_HASH_SIZE] ; 
typedef  unsigned  char  shal_hash_t [SHA1_HASH_SIZE] ; 
typedef  sha256_hash_t  *sha256_hash_ptr_t; 
typedef  md5_hash_t  *md5_hash_ptr_t ; 
typedef  shal_hash_t  *shal_hash_ptr_t; 


C.4.4  Hash  Flags 

These  declarations  identify  the  six  hash  algorithms  that  may  be  selected. 


/ /  hash  flags 
#define  EILE_SHA256  0x01 
# define  EILE_MD5  0x02 
#define  EILE_SHA1  0x04 
#define  SECTORS_SHA256  0x08 
#define  SECTORS_MD5  0x10 
#define  SECTORS_SHAl  0x20 
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C.4.5  Data  Context  States 

These  deelarations  define  the  four  states  that  a  data  eontext  may  be  in.  File,  data,  and  hash 
interfaces  will  fail  if  their  associated  data  contexts  are  in  incorrect  states.  Also,  calculated  hash 
values  are  only  valid  when  data  contexts  are  in  their  correct  states. 

//  data  context  state 
#define  DATA_EMPTY  0 
#define  DATA_READ  1 
#define  DATA_HASHING  2 
#define  DATA_HASHED  3 

•  DATA_EMPTY  indicates  that  the  data  context  has  no  data,  has  no  file  attached  to  it,  and 
has  no  calculated  hash  values .  It  is  ready  for  a  file  read  operation  via  file_read_sectors. 
A  hashing  operation  may  be  performed  on  the  empty  data,  but  no  action  will  take  place. 

•  DATA_READ  indicates  that  data  has  been  read  and  a  file  has  been  attached  to  it,  but  the 
data  not  been  hashed. 

•  DATA_HASHING  indicates  that  the  data  is  being  hashed  via  a  hashing  operation  such  as 
lanes _h ash.  A  poll  operation  returning  completion  status  is  required  for  transitioning 
the  data  context  state  from  DATA  HASHING  to  DATA  HASHED. 

•  DATA_HASHED  indicates  that  data  has  been  hashed  and  hash  values  are  available.  Sector 
hashes  are  available.  If  the  data  context  is  at  EOE,  file  hashes  are  also  available.  Issue 
data_clear_hashed_state  to  release  the  file  attached  to  it  and  to  transition  the  data 
context  state  from  DATA_HASHED  back  to  DATA_EMPTY.  The  file  cannot  be  closed  via 
file  _c  lose  until  it  is  released. 

C.4.6  Data  Context  Type  Declarations 

A  data  context  holds  data  read  from  a  file,  tracks  the  file  the  data  came  from,  and  provides  space 
for  holding  Sector  hashes. 

When  a  data  context  is  created,  memory  space  is  allocated  for  it  that  is  the  size  of  data_ctx_ptr 
plus  the  size  of  the  data  and  Sector  hash  space  required  based  on  the  number  of  sectors  that  it 
can  hold. 

//  data  types 
typedef  struct  data_ctx_t 
{ 

unsigned  int  sectors_per_read;  //  private 
unsigned  long  long  num_previous_bytes ;  //  private 

unsigned  int  num_current_bytes ;  //  private 
unsigned  int  eof;  //  public 
unsigned  int  data_ctx_state;  //  public 
f ile_ctx_ptr_t  f ile_ctx_ptr;  //  private 
unsigned  char  *data;  //  private  malloc 
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sha256_hash_t  *sha256_sector_hash_ptr;  //  public  malloc 

md5_hash_t  *md5_sector_hash_ptr;  //  public  malloc 

shal_hash_t  *shal_sector_hash_ptr;  //  public  malloc 

unsigned  char  heap [ 0 ]  _ attribute _  ((aligned  (16)));  //  start  of  heap  S] 

}  data_ctx_t; 

typedef  data_ctx_t  *data_ctx_ptr_t ; 

•  Field  sectors_per_read  defines  the  maximum  number  of  seetors  that  may  be  read 
during  a  f  ile_read_sectors  file  read  operation.  This  value  is  established  during  the 
data.alloc  eall.  The  amount  of  data  and  Seetor  hash  spaee  reserved  is  based  on  this 
value. 

•  Field  num_previous_bytes  traeks  how  mueh  data  has  been  read  before  the  eurrent 
read,  and  is  used  for  tracking  Sector  numbers  and  determining  the  EOF  condition. 

•  Field  num.current  Joytes  identifies  how  many  bytes  are  read  in  the  current  read,  and 
is  used  to  determine  the  number  of  Sectors  read,  the  number  of  blocks  read,  and  the  EOF 
condition. 

•  Field  eof  indicates  that  there  are  no  more  bytes  to  process.  Note  that  this  is  different 
from  field  eof  in  f  ile_ctx_t  which  may  reach  EOF  while  more  than  one  data  context 
may  be  available  containing  data  from  the  same  file. 

•  Field  data_context_state  contains  the  state  of  the  data  context,  which  is  one  of 
DATA_EMPTY,  DATA_READ,  DATA_HASHING,  or  DATA_HASHED. 

•  Field  f  ile_ctx_ptr  points  to  the  file  context  from  which  the  last  file  read  was  made, 
or  is  NUEE  if  the  file  context  has  been  released  using  data_clear_hashed_state. 

•  Eield  data  points  to  the  start  of  the  data  space  to  be  read  using  f  ile_read_sectors. 

Eield  data  starts  at  the  address  of  field  heap  and  is  16-byte  aligned  for  DMA  transfers. 

Its  size  is  determined  by  the  number  of  sectors  per  read,  specifically  sectors_per_read 
*  4  0  9  6. 

•  Eields  sha256_sector_hash_ptr,  md5_sector_hash_ptr,  and  shal_sector_hash_ptr 

point  to  their  respective  arrays  of  SHA-256,  MD5,  and  SHA-1  Sector  hashes.  They  im¬ 
mediately  follow  the  allocation  for  field  data.  Their  size  is  determined  by  the  number 
of  sectors  per  read,  field  sectors_per_read. 

C.4.7  File  Context  Type  Declarations 

A  file  context  contains  information  regarding  an  opened  file  including  a  pointer  to  its  name,  its 
file  size,  how  many  bytes  have  been  read  so  far,  and  the  file  hash  values  being  calculated  for  it. 

When  a  file  context  is  created,  memory  space  is  allocated  for  it  that  is  the  size  of  f  i  le_ct  x_t . 

//  file  types 

typedef  struct  file_hash_t 
{ 

sha256_hash_t  sha256_hash  _ attribute _  ((aligned  (16)));  //  private 
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md5_hash_t  md5_hash;  //  private 
shal_hash_t  shal_hash;  //  private 
}  file_hash_t; 

typedef  struct  file_ctx_t 
{ 

char  *filename;  //  public 
int  f ile_descriptor;  //  private 
off_t  file_size;  //  private 
off_t  total_bytes_read;  //  private 
int  eof;  //  public 

unsigned  int  data_ctx_attach_count ;  //  private 
file_hash_t  file_hash;  //  public 
}  file_ctx_t; 

typedef  file_ctx_t  *f ile_ctx_ptr_t ; 

•  Field  filename  points  to  the  name  of  the  opened  file. 

•  Field  f  ile_descriptor  points  to  the  opened  file. 

•  Field  file.size  indieates  the  size  of  the  file. 

•  Field  total_bytes_read  indieates  the  number  of  bytes  read  so  far. 

•  Field  eof  indicates  that  there  are  no  more  bytes  to  read. 

•  Field  data_ctx_attach_count  tracks  how  many  file  reads  are  active. 

•  Field  f  ile_hash  contains  the  SHA-256,  MD5,  and  SHA-1  file  hashes.  File  hash  values 
are  transient  or  contain  running  hash  values  until  1)  the  associated  data  context  state 
reaches  DATA_HASHED  and  2)  the  associated  data  has  reached  EOF. 
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D  NPS  Cell  SPU  Hash  Utilities  Overview 

These  utilities  provide  optimized  hashing  algorithms  written  in  assembly  and  run  on  SPUs. 
SHA-256,  MD5,  and  SHA-1  hashing  algorithms  are  provided.  Optimizations  for  maximum 
total  throughput,  maximum  single-file  throughput,  and  Seetor  hashing  are  provided. 

The  NPS  Cell  SPE  Hash  Utilities  are  deseribed  in  MAN  pages  available  in  our  souree  tree  at 

src/hash_algorithms/ cell_hash_man/ spu_hash. 

D.l  Maximum  Total  Throughput 

These  interfaees  provide  maximum  total  throughput  by  hashing  four  separate  datasets  in  lanes. 
These  interfaces  should  be  used  when  multiple  files  may  be  hashed  concurrently. 

•  sha256_hash_41anes 

•  md5_hash_41anes 

•  shal_hash_41anes 

D.2  Maximum  Single-file  Throughput,  Pipelined 

These  interfaces  provide  maximum  single-file  throughput  by  hashing  using  two  pipelined  SPUs. 
SPU  pipeline  1  prepares  alb  KByte  precomputation  table  from  4  KBytes  of  input,  while  SPU 
pipeline  2  hashes  the  16  KBytes  of  precomputation  data.  Pipelined  processing  is  available  for 
SHA-256  and  SHA-1  and  is  not  available  for  MD5. 

•  sha256_hash_pipel 

•  sha256_hash_pipe2 

•  shal_hash_pipel 

•  shal_hash_pipe2 

D.3  Maximum  Single-file  Throughput,  Single-SPU 

These  interfaces  provide  maximum  single-file  throughput  by  hashing  a  single  stream  on  a  single 
SPU.  For  SHA-256  and  SHA-1  processing,  use  Maximum  Single-file  Throughput,  Pipelined  in¬ 
terfaces,  if  possible,  and  use  Maximum  Single-file  Throughput,  Single-SPU  interfaces  when  fi¬ 
nalizing  the  last  block.  For  MD5  processing,  use  the  Maximum  Single-file  Throughput,  Single- 
SPU  interface  to  hash  the  entire  single  file. 

•  sha256_hash 

•  md5_hash 

•  shal.hash 

D.4  Sector  Hashing 

These  interfaces  provide  maximum  Sector  hashing  throughput  by  hashing  four  Sectors  concur¬ 
rently,  one  in  each  lane.  These  interfaces  combine  hash  initialization,  hashing,  and  finalization 
within  the  same  assembly  call  for  improved  performance. 
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sha2  5  6_hash_4  sectors 
md5_hash_4  sectors 
shal_hash_4  sectors 


E  Application  hash  files  for  Maximum  Throughput 

We  provide  maximum  hashing  throughput  via  program  hash  .files. 

E.l  hash.files  User  Interface 

Usage: 

hash_files  -1  <filename>  -a  <hash  algorithm>  [-s] 

[-r  <num_4K_sectors> ]  [-n  <number  of  SPUs>] 

-1  <f  ilename>  specifies  the  filename  of  a  file  containing  a  list  of  filenames  to  hash,  one 
filename  per  line,  no  spaces,  each  line  terminated  by  a  carriage  return.  This  selection  is 
required. 

-a  <algorithm>  identifies  the  hashing  algorithm(s)  to  use.  Options  are  sha2  5  6,  mdS, 
shal,  and  all  for  all  three. 

-s  specifies  to  also  perform  Sector  hashing  in  addition  to  file  hashing. 

-r  <num_4K_sectors>  identifies  the  number  of  sectors  to  read  per  file  read.  A  value  of 
two  may  be  optimal  for  many  small  files,  and  16  may  be  optimal  for  mixed  small  and  large 
files.  1024  may  be  optimal  for  many  large  files.  Large  values  may  cause  page  caching. 

-n  <number  of  SPUs>  identifies  the  number  of  SPUs  to  use.  We  recommend  using  all 
SPUs  or  else  enough  SPUs  to  keep  the  hashing  process  I/O  bound  rather  than  processing 
bound.  I/O  performance  is  based  on  the  hardware  and  file  systems  providing  the  files. 


E.2  Output 

Output  is  sent  to  stdout  and  consists  of  multiple  file  and  Sector  hash  values  separated  by 
carriage  returns.  File  hash  output  is  as  follows: 

<filename>  <algorithm>  <hash  value> 

Sector  hash  output  is  as  follows: 

<filename>  Sector  <sector  number>  <algorithm>  <hash  value> 

Text  <f  ilename>  is  the  path  of  the  filename  obtained  from  the  file  list  file,  <algorithm> 
is  one  of  SHA-25  6,  MDS,  or  SHA-1,  and  <hash_value>  consists  of  sets  of  eight-digit  hex¬ 
adecimal  values  separated  by  single  space  characters.  For  Sector  hashes,  the  word  Sector,  a 
space,  and  the  integer  Sector  number  starting  at  zero  are  also  provided. 

An  example  MDS  file  and  Sector  hash  for  14  KByte  file  myf  ile  follows: 
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myfile  Sector  0  MD5 
myfile  Sector  1  MD5 
myfile  Sector  2  MD5 
myfile  MD5  bbc725c7 


14C43752 

1390b353 

456bed94 

eaabd331 


b777ad9c 
cba445e0 
e3b5cb37 
2  52dde  6b 


107fc589 

f033d799 

21c20109 

5d2cc600 


b6f aa235 
62  6c97  5a 
a98e599a 


E.3  Design  of  the  hash.f  iles  Application 

The  hash  _f  lies  application  runs  on  top  of  NFS  Cell  FPU  Hash  Utilities.  The  NPS  Cell  PPU 
Hash  Utilities  drive  one  or  more  SPUs  which  in  turn  make  calls  to  Cell  SPE  Hash  Utilities  for 
performing  the  actual  hashing.  The  hash  .files  application  is  available  in  our  source  tree  at 

src/hash_algorithms/ cell.hash. 

E.3.1  Local  Data  Structures 

•  lanes_spu_ctx_ptr  [MAX.SPUS]  contains  the  lane  SPU  context  pointers  for  each 
SPU  performing  lane  processing.  Lane  SPU  context  pointers  identify  the  SPU  to  use 
when  hashing  on  SPUs  in  lanes. 

•  data_ctx_pt  r  [MAX.SPUS  ]  [2]  [  4  ]  contains  a  matrix  of  data  context  pointers.  A  set 
of  pointers  is  defined  for  each  SPU.  On  each  SPU,  two  sets  are  provided,  allowing  double¬ 
buffering  of  file  reads.  Four  pointers  per  SPU  set  are  provided  to  support  processing  in 
four  lanes. 

•  f ile_ctx_ptr_queue  [MAX.SPUS]  [4]  contains  a  matrix  of  file  pointers,  one  per 
lane  per  SPU. 

When  new  files  are  read,  they  are  bound  to  a  position  in  the  f  ile_ctx_ptr_queue  and 
in  the  data_context_ptr.  When  a  file  is  read  and  it  reaches  EOF,  its  file  context  is 
removed  from  f  ile_ctx_ptr_queue  so  that  a  new  file  context  may  be  placed  there, 
but  the  file  context  is  not  deallocated.  File  contexts  are  deallocated  by  local  helper  func¬ 
tion  clear_hash_state  after  hashing  for  the  file  has  fully  completed.  Managing  file 
contexts  this  way  allows  lanes  to  have  data  from  a  new  file  on  the  next  cycle  after  the 
previous  file,  allowing  a  new  file  to  be  loaded  and  read  before  the  hash  for  the  previous 
file  is  calculated  and  reported. 

E.3.2  Local  Helper  Functions 

These  functions  assist  the  hash.f  iles  program  by  wrapping  functions  into  conceptual  tasks 
in  groups  of  four  for  lane  processing. 

•  start-hashing  ( )  schedules  the  hashing  of  four  data  contexts  on  an  SPU  based  on 
the  SPU  and  dataset  index  values. 

•  report-hashes  ( )  prints  calculated  Sector  hashes  for  the  four  datasets  identified  by 
the  given  SPU  and  dataset  index.  Where  file  contexts  indicate  EOF,  calculated  file  hashes 
are  also  printed. 

•  clear_hash_state  ()  calls  data_clear_hash_state  ()  to  clear  the  hash  state 
for  each  of  the  four  datasets  identified  by  the  given  SPU  and  dataset  index.  When  datasets 
are  at  EOF  and  hashes  have  been  consumed,  their  associated  files  are  closed. 
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•  read_files  ()  calls  function  f  ile_read_sectors  ()  to  read  four  files  into  data 
eontexts  identified  by  the  given  SPU  and  dataset  index.  If  a  file  is  at  EOF,  it  is  removed 
from  f  ile_ctx_ptr_queue  and  a  new  file  is  loaded  and  read  in  its  plaee. 


E.3.3  hash_f  iles  Main  Entry 

The  main  entry  parses  input  to  identify  the  following  runtime  parameters: 

•  num.spus  identifies  the  number  of  SPUs  to  parallel-proeess  on.  One  SPU  eontext  is 
ereated  for  eaeh  SPU. 

•  hash_f  lags  identifies  the  hash  algorithms  to  be  performed  on  the  files,  up  to  six  algo¬ 
rithms:  file  and  Seetor  hashes  for  SHA-256,  MD5,  and  SHA-1  algorithms. 

•  1  i  St  _f  i  lename  identifies  the  file  containing  the  list  of  filenames  to  be  hashed. 

•  sectors _per_read  identifies  the  maximum  number  of  seetors  to  read  in  a  file  read 
operation. 

Once  these  runtime  parameters  are  set,  the  main  program  ealls  funetionprocess.f  iles.lanes  ( ) 
whieh  alloeates  proeessing  resources,  iteratively  reads  and  hashes  files  using  lane  proeessing, 
then  dealloeates  resources  and  exits. 


E.3.4  Wrapper  Function  process_files_lanes  () 

This  funetion  opens  SPU  and  data  resources  for  hashing  in  lanes,  iteratively  hashes  files  using 
double-buffering,  eloses  SPU  and  data  resources,  then  exits.  Flow  is  as  follows: 

1:  create  n  SPU  contexts  using  lanes_open_spu(),  create  a  matrix  of  data  contexts  using  data_alloc  ( ) , 
and  open  the  list  of  filenames  to  be  hashed  using  local  helper  function  open.list  { ) 

2:  loop 

3:  for  each  SPU  n  do 

4:  poll  SPU  availability  using  function  lanes4)oll() 

5:  if  poll  status  indicates  that  the  SPU  is  ready  then 

6:  start  hashing  the  next  set  of  lanes  on  the  SPU  using  local  helper  function  start_hashing  { ) 

7:  report  hashes  from  the  current  set  of  lanes  using  local  helper  function  report_hashe  s  ( ) 

8;  clear  hash  states  from  the  current  set  of  lanes  using  local  helper  function  clear_hash_state  ( ) 

9:  read  files  into  the  current  set  of  available  lanes  using  local  helper  function  read_f  iles  ( ) 

10:  swap  the  current  index  and  next  index  values  for  SPU  n  for  the  next  loop  through 

1 1 :  end  if 

12:  end  for 

13:  exit  loop  if  the  filename  list  is  exhausted  and  all  data  context  states  are  at  DATA_EMPTY 

14:  end  loop 

15:  close  the  list  of  consumed  filenames  using  local  helper  function  close _list  (),  close  all  data 
contexts  using  data_f  ree  ( ) ,  and  close  the  n  SPU  contexts  using  lanes_close_spu  ( ) 

Note  that  the  first  iteration  of  the  loop  hashes  and  reports  empty  data.  This  is  acceptable  because 
hashing  and  printing  empty  data  does  nothing.  Also  note  that  file  reads  are  double-buffered  and 
hashing  is  started  immediately  after  polling  indicates  that  the  SPU  is  ready.  This  design  strategy 
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minimizes  the  delay  between  when  an  SPU  is  ready  and  when  the  SPU  is  tasked  with  work, 
improving  performance. 

E.3.5  PPU  Call  Tree 

The  PPU  call  tree  is  shown  in  Figure  9. 


hash  files  *1 
processjilesjanes  *2 


start_hashing  *2  report_hashes  *2 

I  d 

lanes_hash  *4  print_hashes  *3 


clear  hash  state  *2 
data_clear_hash_state  *5  file_close_fromJist  *2 

I 

file_olose  *5 


readjiles  *2 

I 

/  file_open_fromJist  *2' 

I  file_read_seotors  *5 

file_open  *5 


*1  Application:  hash  files.c 
*2  Helper  function:  processjilesjanes.c 
*3  Helper  function:  print_hashes.c 
*4  PPU  Utilities:  lanes_hash.c 
*5  PPU  Utilities:  file  context.c 


Figure  9:  PPU  call  tree. 


E.3.6  Example  Flow 

The  following  example  steps  illustrate  state  transition  flow  as  two  files  are  loaded  and  hashed 
on  one  SPU.  For  simplification,  only  one  SPU  and  one  lane  are  shown. 

1.  Start:  No  files  have  been  read,  data  contexts  are  in  their  initial  condition,  and  the  SPU  is 
ready.  See  Figure  10. 


File  List  Pointer 

Active  File  Contexts 

Data  Context[0] 

Next  -► 

filet 

None  active 

state  =  DATA_EMPTY 

file2 

num_current_bytes  =  C 

EOF 

file_ctx_ptr  =  NULL 

SPU  Context 

Mapped  File  Context  Pointer 

Data  Context[1] 

NULL 

state  =  DATA_EMPTY 

state  =  SPU_ 

LU 

_l 

CO 

< 

I 

num_current_bytes  =  C 

file_ctx_ptr  =  NULL 

Figure  10:  Step  1 :  Resources  in  their  initial  starting  state. 

2.  Poll  until  SPU  done:  The  SPU  has  not  been  tasked,  so  this  returns  immediately. 

3.  Start  hashing  data_ctx  [  1  ] :  The  data  context  is  empty,  so  the  SPU  is  not  tasked. 
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4.  Report  hashes  for  data_ctx  [  0  ] :  The  data  eontext  is  empty,  so  nothing  is  reported. 

5.  Clear  hashes  for  data_ctx  [  0  ] :  The  data  eontext  is  empty,  so  no  action  is  taken. 

6.  Read  file  filel  to  data_ctx  [  0  ] : 

(a)  The  file  context  pointer  mapped  to  this  lane  is  NULL,  so  get  the  next  filename 
filel  from  the  file  list  pointer,  open  an  active  file  context  for  it,  and  map  it  to 
the  data  context. 

(b)  Read  into  dat  a_ctx  [  0  ]  from  the  newly  mapped  filel.  Because  the  file  is  small, 
the  file  context  reaches  EOF  on  this  first  read  and  the  data  context  is  also  marked 
EOF. 

See  Figure  11. 

File  List  Pointer  Active  File  Contexts 

fj|0l  “filel",  E0F=1 

Next  -►  file2 
EOF 


Data  Context[1] 

state  =  DATA_EMPTY 
num_current_bytes  =  0 
file_ctx_ptr  =  NULL 

Figure  11:  Step  6:  Resources  after  reading  file  filel. 

7.  Continue  the  processing  loop  because  the  file  list  has  not  reached  EOF  yet. 

8.  Poll  until  SPU  done.  The  SPU  has  still  not  been  tasked,  so  this  returns  immediately. 

9.  Start  hashing  data.ctx  [  0  ] : 

(a)  The  SPU  state  transitions  to  SPU_NOT_AVAIFABFE. 

(b)  Data  context  state  for  data_ctx  [  0  ]  transitions  to  DATAIHASHING. 

(c)  The  PPU  signals  the  SPU  to  start  hashing. 

See  Figure  12. 

Active  File  Contexts 
“filel",  EOF=1 


SPU  Context  Mapped  File  Context  Pointer  Data  Context[1] 

state  =  DATA_EMPTY 
num_current_bytes  =  0 
file_ctx_ptr  =  NULL 

Figure  12:  Step  9:  Resources  after  starting  to  hash  data_ctx  [  0  ] . 

10.  Report  hashes  for  data_ctx  [  1  ] :  Data  contexts  are  still  empty,  so  nothing  is  reported. 


:  SPU_NOT_AVAILABLE 


Data  Context[0] 

state  =  DATA_HASHING 
num_ourrent_bytes  =  500 
file_ctx_ptr  =  ->file1 
EOF=1 


File  List  Pointer 


SPU  Context 


Mapped  File  Context  Point 


state  =  SPU_AVAILABLE 


Data  Context[0] 

state  =  DATAREAD 
num_current_bytes  =  500 
file_ctx_ptr  =  ->file1 
EOF=1 
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11.  Clear  hashes  for  dat  a_ct  x  [  0  ] :  Data  eontexts  are  already  empty,  so  no  aetion  is  taken. 

12.  Read  file  file2  to  data_ctx  [  1  ] : 

(a)  The  file  eontext  pointer  mapped  to  this  lane  is  NULL,  so  get  the  next  filename 
file 2  from  the  file  list  pointer,  open  an  aetive  file  eontext  for  it,  and  map  it  to 
the  data  eontext. 

(b)  Read  into  data.ctx  [  1  ]  from  mapped  f  ile2.  Beeause  the  file  is  small,  the  file 
eontext  reaehes  EOF  on  this  first  read  and  the  data  eontext  is  also  marked  EOF. 

See  Figure  13. 


File  List  Pointer 

Active  File  Contexts 
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filet 
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EOF 

file_ctx_ptr  =  ->filet 

EOF=t 

SPU  Context 

Mapped  File  Context  Pointer 

Data  Context[t] 

->fil©2 

state  =  DATAREAD 

state  =  SPU_ 

NOT_AVAILABLE 

num_current_bytes  =700 

file  ctx  ptr  =  ->file2 

EOF=t 

Figure  13:  Step  12:  Resources  after  reading  file  file2. 


13.  Continue  the  proeessing  loop  beeause  the  file  list  has  not  reaehed  EOF  yet. 

14.  Poll  until  SPU  done  to  make  sure  the  SPU  has  eompleted  hashing  data.ctx  [  0  ] .  When 
done: 

(a)  The  SPU  state  has  transitioned  to  SPUWVAIFABFE. 

(b)  Data  eontext  data.ctx  [  0  ]  is  transitioned  to  DATA_HASHED. 

15.  Start  hashing  data.ctx  [  1  ] : 

(a)  The  SPU  state  transitions  to  SPU_NOT_AVAIFABFE. 

(b)  Data  eontext  state  for  data.ctx  [  1  ]  transitions  to  DATAJdASHING. 

(e)  The  PPU  signals  the  SPU  to  start  hashing. 

See  Figure  14. 


File  List  Pointer 

Active  File  Contexts 

Data  Context[0] 

filet 

“filet",  EOF=t 

state  =  DATA_HASED 

file2 

“file2”,  EOF=t 

num_current_bytes  =  500 

Next  -► 

EOF 

file_ctx_ptr  =  ->filet 

EOF=t 

SPU  Context 

Mapped  File  Context  Pointer 

Data  Context[t] 

->fil©2 

state  =  DATA_HASHING 

state  =  SPU_ 

NOT_AVAILABLE 

num_current_bytes  =700 

file  ctx  ptr  =  ->file2 

EOF=t 

Figure  14:  Step  15:  Resources  after  starting  to  hash  data_ctx  [  1  ] . 
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16.  Report  hashes  for  data_ctx  [  0  ] : 

(a)  Report  file  hashes  beeause  data_ctx  [  0  ]  indieates  EOF. 

(b)  Do  not  report  Seetor  hashes.  The  file  size  is  less  than  4  KBytes  so  no  Seetor  hashes 
are  eomputed. 

17.  Clear  hashes  for  data.ctx  [  0  ] : 

(a)  data.ctx  [  0  ]  indieates  EOF  so  elose  the  aetive  file  eontext  for  f  ilel  indicated 
by  data_ctx  [  0  ]  . 

(b)  The  data.ctx  [  0  ]  state  transitions  to  DATA_EMPTY. 

See  Figure  15. 


File  List  Pointer 


Active  File  Contexts 


“file2”,  E0F=1 


Data  Context[0] 

state  =  DATA_EMPTY 
num_current_bytes  =  0 
file_ctx_ptr  =  NULL 


SPU  Context 

Mapped  File  Context  Pointer 

Data  Context[1] 

state  =  SPU_NOT_AVAILABLE 

->file2 

state  =  DATA_HASHING 
num_current_bytes  =700 

file_ctx_ptr  =  ->file2 

E0F=1 


Figure  15:  Step  17:  Resources  after  clearing  data_ctx  [  0  ] . 


18.  Read  nothing  because  the  read  returned  EOF. 

19.  Continue  the  processing  loop  because  although  the  file  list  is  at  EOF,  data.ctx  [1] 
state  has  not  transitioned  back  to  DATA_EMPTY. 

20.  Poll  until  SPU  done  to  make  sure  the  SPU  has  completed  hashing  data.ctx  [  1  ] .  When 
done: 

(a)  The  SPU  state  transitions  to  SPU  AVAIFABFE. 

(b)  Data  context  data.ctx  [  1  ]  transitions  to  DATAJdASHED. 

See  Figure  16. 

Active  File  Contexts 


“file2",  EOF=1 


SPU  Context  Mapped  File  Context  Pointer 


Figure  16:  Step  20:  Resources  after  waiting  for  hash  data_ctx  [  1  ]  to  complete. 


state  =  SPU_AVAILABLE 

->file2 

Data  Context[0] 

state  =  DATA_EMPTY 
num_current_bytes  =  0 
file_ctx_ptr  =  NULL 


Data  Context[1] 

state  =  DATA_HASHED 
num_current_bytes  =700 
file_ctx_ptr  =  ->file2 
EOF=1 


File  List  Pointer 


21.  Start  hashing  data_ctx  [  0  ] : 
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(a)  The  SPU  state  transitions  to  SPU  NOT  AVAILABLE. 

(b)  Data  context  state  for  data_ctx  [  0  ]  transitions  to  DATAJHASHING. 

(c)  The  PPU  signals  the  SPU  to  start  hashing. 

See  Figure  17. 


File  List  Pointer 


Active  File  Contexts 


“file2”,  E0F=1 


Data  Context[0] 

state  =  DATA_EMPTY 
num_current_bytes  =  0 
file_ctx_ptr  =  NULL 


SPU  Context 

Mapped  File  Context  Pointer 

Data  Context[1] 

state  =  SPU_NOT_AVAILABLE 

->file2 

state  =  DATA_HASHED 
num_current_bytes  =700 

file_ctx_ptr  =  ->file2 

E0F=1 


Figure  17:  Step  21:  Resources  after  starting  to  hash  data_ctx  [  0 ; 


22.  Report  calculated  hashes  for  data_ctx  [  1  ] : 

(a)  Report  file  hashes  because  data_ctx  [  1  ]  indicates  EOF. 

(b)  Do  not  report  Sector  hashes.  The  file  size  is  less  than  4  KBytes  so  no  Sector  hashes 
are  computed. 

23.  Clear  hashes  for  data.ctx  [  1  ] : 

(a)  data_ctx  [  1  ]  indicates  EOF  so  close  the  active  file  context  for  f  ile2  indicated 
by  data_ctx  [  1  ] . 

(b)  Transition  the  data.ctx  [  1  ]  state  to  DATA_EMPTY. 

See  Figure  18. 


File  List  Pointer 


Next 


filet 

file2 

EOF 


Active  File  Contexts  Data  Context[0] 

state  =  DATA_EMPTY 
num_current_bytes  =  0 
file_ctx_ptr  =  NULL 


None  active 


SPU  Context  Mapped  File  Context  Pointer 


state  =  SPU_NOT_AVAILABLE 

NULL 

Data  Context[1] 

state  =  DATA_EMPTY 
num_current_bytes  =0 
file_ctx_ptr  =  NULL 


Figure  18:  Step  23:  Resources  after  clearing  data_ctx  [  1  ] . 


24.  Read  nothing  (again)  because  the  file  list  pointer  is  at  EOF. 

25.  Terminate  the  processing  loop  because  the  file  list  is  at  EOF  and  all  dat  a_ctx  [  *  ]  states 
have  transitioned  back  to  DATA_EMPTY. 
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E.4  Design  of  SPU  Program  lanes _spu 

The  lanes.spu  Program  initializes,  then  loops  receiving  and  processing  hashing  jobs  until 
the  SPU -DONE  signal  is  received. 

The  main  SPU  program  flow  is  as  follows: 

1:  obtain  PPU  shared  data  space  ea_ppu_shared_data  from  parameter  input 
2:  obtain  the  SPU  index  used  for  diagnostics  from  parameter  input 
3:  initialize  incoming  and  outgoing  signal  values  to  inactive 
4;  mbox  mail  the  SPU  shared  data  space  ls_spu_shared_data  to  the  PPU 
5;  loop 

6:  continue  loop  until  signal  changes  from  SPU_NO_SIGNAL 

7:  take  the  tick.start  timestamp  if  timestamp  is  compiled  in  via  the  __TIMESTAMP__  compiler 

flag 

8;  quit  if  signal  is  SPU -DONE 

9;  run  the  hash  job  by  calling  function  spu_process  Janes() 

10;  take  the  tick-finish  timestamp  if  timestamp  is  compiled  in  via  the  __TIMESTAMP__  compiler 
flag 

11:  signal  SPUuWAIEABEE  to  the  PPU  to  indicate  that  the  SPU  is  done  and  is  ready  for  a  new  job 

12:  end  loop 

The  spu_process_lanes  program  flow  processes  individual  hashing  assignments  by  read¬ 
ing  four  data  contexts  and  performing  hashing  on  them  as  follows: 

1 ;  start  and  finish  DMA  get  set  of  four  data  context  pointers 

2:  start  and  finish  DMA  get  four  data  contexts 

3:  start  and  finish  DMA  get  three  starting  file  hash  values 

4:  set  sector_index  to  0  for  looping  processing  one  Sector  at  a  time 

5;  start  DMA  get  current  (first)  data  buffer 

6:  while  more  data  do 

7:  start  DMA  get  next  data  buffer  using  DMA  tag  for  next  index 

8;  wait  for  DMA  get  of  current  DMA  tag  to  complete  (this  includes  the  data  buffer  and  sector  hashes) 

9:  perform  a  file  hash  of  the  current  data  buffer,  finalizing  any  file  hashes  that  are  at  EOE 

10;  perform  a  sector  hash  of  the  current  data  buffer 

1 1 :  start  DMA  put  of  current  calculated  sector  hashes  using  DMA  tag  for  current  index 

12:  swap  current  and  next  index  values 

13:  end  while 

14;  start  and  finish  DMA  put  three  calculated  file  hash  values 

15;  wait  for  DMA  put  of  final  sector  hash  using  DMA  tag  for  the  last  index,  which  is  now  next 

E.4.1  SPU  Call  Tree 

The  SPU  call  tree  is  shown  in  Figure  19. 
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Ianes_spu  *1 

i 

spu_process_lanes  *2 


start_get  *2 
wait_tag  *2 
do_hash  *2 
get_file_hashes  *2 
puUile_hashes  *2 
put_sha256_sector_hashes  * 
put_md5_sector_hashes  *2 
put_sha1_sector_hashes  *2 


A 


finalize_sha256_data  *2 
finalize_md5_data  *2 
finalize_sha1_data  *2 


sha256_hash_4lanes  *3 
sha256_hash  *3 
sha256_hash_4sectors  *3 
md5_hash_4lanes  *3 
md5_hash  *3 
md5_hash_4sectors  *3 
sha1_hash_4lanes  *3 
sha1_hash  *3 
sha1_hash_4sectors  *3 


*1  SPU  Application:  lanes_spu.c 
*2  Helper  function:  spujorocessjanes.c 
*3  SPU  Utilities 


Figure  19:  SPU  call  tree. 
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F  Application  hash  single  file  for  Maximum  Single-file 
Throughput 

Maximum  single-file  throughput  is  achieved  using  application  hash_single_f  ile. 


F.l  hash  single  file  User  Interface 

Usage: 


hash_single_f ile  -f  <filename>  -a  <hash  algorithm> 
[-r  <num_4K_sectors> ]  [-c]  [-v] 


-f  <f  ilename>  specifies  the  name  of  the  file  to  be  hashed.  This  selection  is  required. 

-a  <algorithm>  identifies  the  hashing  algorithm  to  use,  and  is  one  of  sha2  5  6,  mdS  or 
shal. 

-r  <num_4K_sectors>  identifies  the  number  of  sectors  to  read  per  file  read.  The  value 
should  be  large  enough  to  reduce  iterative  reading,  but  small  enough  to  fit  read  data  in 
system  memory  without  causing  page  caching.  Devault  is  1024. 

-c  displays  SPU  resource  information. 

-V  displays  input  parameters. 


F.2  Output 

Output  is  sent  to  stdout  and  consists  of  one  file  hash  value  as  follows: 

<filename>  <algorithm>  <hash  value> 

Text  <filename>  is  the  path  of  the  filename  hashed,  <algorithm>  is  one  of  SHA-256, 
MDS,  or  SHA-1,  and  <hash_value>  consists  of  sets  of  eight-digit  hexadecimal  values  sepa¬ 
rated  by  single  space  characters. 

An  example  MDS  file  hash  for  file  my  file  follows: 

myfile  MDS  bbc72Sc7  eaabdSSl  2S2dde6b  Sd2cc600 
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F.3  Design  of  the  hash_single_f  ile  Application 

The  hash_single_f  ile  Application  runs  on  top  of  NFS  Cell  FPU  Hash  Utilities.  The  NPS 
Cell  PPU  Hash  Utilities  drive  either  a  single  SPU  for  hashing  MD5  or  a  pair  of  SPUs  for 
hashing  SHA-256  or  SHA-1.  The  SPUs  make  calls  to  Cell  SPE  Hash  Utilities  for  performing 
the  actual  hashing.  The  hash_single_f  ile  application  is  available  in  our  source  tree  at 

src/hash_algorithms/ cell-hash. 

The  hash-single_f  ile  Application  uses  file  pro cess_pipe  .  c  to  hash  SHA-256  or  SHA- 
1  and  uses  fileprocess_md5  .  c  tohashMDS.  Both  files  are  identical  except  that  process_pipe  .  c 
makes  calls  to  Pipe  Hash  interfaces  and  process_md5  .  c  makes  calls  to  Single  MD5  Hash 
interfaces. 

F.3.1  Local  Data  Structures 

•  pipe_spu_ctx-ptr  (or  md5_spu_ctx-ptr)  contains  the  SPU  context  pointer  for  the 
SPUs  (or  SPU)  performing  the  hash  processing. 

•  data_ctx_ptr  [2]  contains  two  data  context  pointers  for  supporting  hashing  using 
double-buffering . 

•  f  ile_ctx_ptr  points  to  the  file  context  that  will  be  active  through  the  duration  of  hash¬ 
ing. 

F.3.2  Local  Helper  Functions 

These  functions  assist  the  hash_slngle_f  lie  program  by  wrapping  functions  into  concep¬ 
tual  tasks  for  double-buffered  hash  processing. 

•  start-hashing  ( )  schedules  the  hashing  of  the  data  context. 

•  report-hash  ( )  resets  the  data  context  for  the  next  read.  Where  file  contexts  indicate 
EOF,  the  calculated  file  hash  is  also  printed. 

•  clear_hash_state  ()  calls  data_clear_hash_state  ()  to  clear  the  hash  state 
for  the  dataset.  When  clear_hash_state  ( )  is  called  and  the  associated  data  context 
is  at  EOF,  the  associated  file  is  closed. 

•  read-file  ( )  reads  the  file  into  space  in  the  data  context.  If  the  file  is  at  EOF,  it  is 
removed  from  f  lle_ctx_ptr. 

•  data_done  ( )  indicates  when  data  processing  has  completed  by  identifying  when  both 
data  contexts  have  returned  to  the  DATA-EMPTY  state. 

F.3.3  hash_single_f  ile  Main  Entry 

The  main  entry  parses  input  to  identify  the  following  runtime  parameters: 

•  hash-flags  identifies  the  file  hash  algorithm  to  be  performed  on  the  file,  and  is  one  of 
SHA-256,  MD5,  or  SHA-1. 

•  f  1  lename  identifies  the  file  to  be  hashed. 

•  sectors -per -read  identifies  the  maximum  number  of  sectors  to  read  in  a  file  read 
operation. 
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Once  these  runtime  parameters  are  set,  the  main  program  calls  function  process _pipe  () 

(or  process  jnd5  ( ) )  which  allocates  processing  resources,  iteratively  reads  and  hashes  the 
specified  file,  then  deallocates  resources  and  exits. 

F.3.4  Wrapper  Function  process  pipe  ( )  (or  process  jnd5  ( ) ) 

This  function  opens  SPU  and  data  resources  for  hashing,  iteratively  hashes  the  file  using  double¬ 
buffering,  closes  SPU  and  data  resources,  then  exits.  Flow  is  as  follows: 

1;  create  an  SPU  context  using  pipe_open_spu  { )  (or  md5_open_spu  ( ) ),  create  the  two  data  con¬ 
texts  using  data_alloc  ( ) ,  and  open  the  file  to  be  hashed,  assigning  it  to  f  ile_ctx_ptr 

2:  loop 

3:  poll  SPU  availability  using  function  pipe_poll  ( )  or  md5_poll  ( ) 

4:  if  poll  status  indicates  that  the  SPU  is  ready  then 

5:  start  hashing  the  next  file  read  on  the  SPU  using  local  helper  function  start_hashing  ( ) 

6:  report,  if  done,  the  hash  from  the  current  read  using  local  helper  function  report-hash  ( ) 

7:  clear  the  current  hash  state  and,  if  done,  close  the  file,  using  local  helper  function  clear_hash_state  ( ) 

8:  read  from  the  file  into  space  in  the  current  data  context  using  local  helper  function  read_f  i  le  ( ) 

9:  swap  the  current  index  and  next  index  values  for  the  next  loop  through 

10:  end  if 

1 1 :  exit  loop  if  both  data  context  states  are  at  DATA_EMPTY 

12:  end  loop 

13:  close  both  data  contexts  using  data_free  ()  and  close  the  SPU  context  using  pipe_close_spu  ( ) 

(or  md5_close_spu  ( ) ) 

Note  that  the  first  iteration  of  the  loop  hashes  and  reports  empty  data.  This  is  acceptable  because 
hashing  and  printing  empty  data  does  nothing.  Also  note  that  file  reads  are  double-buffered  and 
hashing  is  started  immediately  after  polling  indicates  that  the  SPU  is  ready.  This  design  strategy 
minimizes  the  delay  between  when  an  SPU  is  ready  and  when  the  SPU  is  tasked  with  work, 
improving  performance. 

F.4  Design  of  SPU  Program  pipel.spu 

After  the  pipel.spu  Program  initializes,  it  loops  receiving  and  processing  hashing  jobs  until 
the  SPU -DONE  signal  is  received. 

The  main  outer  SPU  flow  is  identical  to  that  of  the  lanes.spu  program  except  that  it  addi¬ 
tionally  receives  the  EA  of  the  pipe2  processing  space  during  initialization  so  that  the  two  pipes 
can  interact. 

The  inner  flow  is  managed  by  function  spu_process_pipel  which  processes  individual 
hashing  assignments  by  reading  the  data  context  provided  by  the  PPU  and  creating  preprocessed 
schedule  data  from  data  identified  in  the  data  context  as  follows: 

1 :  start  and  finish  DMA  get  of  the  data  context 
2:  done  if  there  are  0  bytes  to  process 

3:  set  sector-index  to  0  for  iooping  processing  one  Sector  at  a  time 

4:  start  DMA  get  current  (first)  data  buffer  using  DMA  tag  for  current  sector  index 
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5;  loop 

6:  if  more  data  is  available  on  the  next  sector.index  then 

7:  start  DMA  get  next  data  buffer  using  DMA  tag  for  next  index 

8;  end  if 

9:  wait  for  DMA  get  of  current  DMA  tag  to  eomplete 

10:  wait  for  pipe2  to  indieate  that  the  current  sehedule  data  buffer  may  be  ehanged 

1 1 :  ealeulate  current  sehedule  data  from  current  data 

12:  signal  pipe2  that  the  current  sehedule  data  buffer  is  ready 

13:  if  last  iteration  then 

14:  if  data  eontext  indieates  EOF  then 

15:  wait  for  pipe2  to  indieate  that  the  current  +  1  sehedule  data  buffer  may  be  ehanged 

16:  finalize  current  data  into  the  current  +  1  sehedule  data  buffer 

17:  signal  pipe2  that  the  current  +  1  schedule  data  buffer  has  finalization  data 

18:  end  if 

19:  break  from  loop 

20:  end  if 

21:  increment  current  by  1 

22:  end  loop 

F.5  Design  of  SPU  Program  pipe2  _spu 

After  the  pipe2_spu  Program  initializes,  it  loops  pulling  and  hashing  schedule  data  until  the 
SPU_DONE  signal  is  received. 

The  main  outer  SPU  flow  is  identical  to  that  of  the  lanes.spu  program  except  that  it  addi¬ 
tionally  receives  the  EA  of  the  pipel  processing  space  during  initialization  so  that  the  two  pipes 
can  interact. 

The  inner  flow  is  managed  by  function  spu_process_pipe2  which  processes  individual 
hashing  assignments  by  reading  the  data  context  provided  by  the  PPU  and  performing  hashing 
based  on  information  from  the  data  context  and  from  schedule  data  prepared  from  pipel  as 
follows: 

1 :  start  and  finish  DMA  get  of  the  data  context 

2:  done  if  there  are  0  bytes  to  process 

3:  start  and  finish  DMA  get  of  the  starting  hash  value 

4:  set  sector_index  to  0  for  looping  processing  one  Sector  at  a  time 

5:  wait  for  pipel  to  signal  that  schedule  data  at  the  current  sector  index  is  available 

6:  start  DMA  get  current  (first)  schedule  data  buffer  from  pipel  using  DMA  tag  for  current  index 

7:  loop 

8:  if  more  data  is  available  on  the  next  sector.index  then 

9:  wait  for  pipel  to  indicate  that  next  schedule  data  is  available 

10:  start  DMA  get  next  schedule  data  using  DMA  tag  for  next  index 

1 1 :  end  if 

12:  wait  for  DMA  get  of  current  DMA  tag  to  complete 

13:  signal  pipel  that  the  current  schedule  data  buffer  may  be  changed 

14:  calculate  the  running  hash  from  the  current  schedule  data 
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15:  if  last  iteration  then 

16:  if  data  context  indicates  EOF  then 

17:  wait  for  pipel  to  indicate  that  the  schedule  data  buffer  at  index  current  +  1  has  finalization 

data 

18:  start  and  finish  DMA  gef  of  fhe  schedule  dafa  from  pipel  current  +  1  index  fo  pipe2  current 

+  1  index 

19:  finalize  fhe  calculafed  hash  by  hashing  fhe  finalizafion  schedule  dafa  af  fhe  current  +  1  index 

20:  end  if 

2 1 :  break  from  loop 

22:  end  if 

23:  incremenf  current  by  1 

24:  end  loop 

25:  sfarf  and  finish  DMA  puf  of  fhe  calculafed  hash  value 

F.6  Design  of  SPU  Program  md5  spu 

After  the  md5_spu  Program  initializes,  it  loops  receiving  and  processing  hashing  jobs  until  the 
SPU  .DONE  signal  is  received. 

The  main  outer  SPU  flow  is  identical  to  that  of  the  lanes.spu  program. 

The  inner  flow  is  managed  by  function  spu_process_md5  which  processes  individual  hash¬ 
ing  assignments  by  reading  the  data  context  provided  by  the  PPU  and  then  hashing  the  data 
identified  in  the  data  context  as  follows: 

1 :  start  and  finish  DMA  get  of  the  data  context 

2:  done  if  there  are  0  bytes  to  process 

3:  start  and  finish  DMA  get  of  the  starting  hash  value 

4:  set  sector.index  to  0  for  looping  processing  one  Sector  at  a  time 

5:  start  DMA  get  current  (first)  data  buffer  using  DMA  tag  for  current  sector  index 

6:  loop 

7:  if  more  data  is  available  on  the  next  sector.index  then 

8:  start  DMA  get  next  data  buffer  using  DMA  tag  for  next  index 

9:  end  if 

10:  wait  for  DMA  get  of  current  DMA  tag  to  complete 

1 1 :  hash  the  current  data 

12:  if  last  iteration  then 

13:  if  data  context  indicates  EOF  then 

14:  finalize  current  data 

15:  end  if 

16:  break  from  loop 

17:  end  if 

18:  increment  cnrrenf  by  1 

19:  end  loop 

20:  Start  and  finish  DMA  put  of  the  calculated  hash  value 


46 


References 

[1]  Raghav  Bhaskar,  Pradeep  K.  Dubey,  Vijay  Kumar,  Atri  Rudra,  and  Animesh  Sharma.  Efficient  Galois  field  arithmetic  on 
SIMD  architectures.  June  7-9  2003. 

[2]  Thomas  Chen,  Ram  Raghaven,  Jason  Dale,  and  Eiji  Iwata.  Cell  Broadband  Engine  Architecture  and  its  first  implemen¬ 
tation.  developerWorks,  November  29  2005.  http://www-12  8.ibm.com/developerworks/power/ 
library /pa-cellperf/. 

[3]  Neil  Costigan  and  Michael  Scott.  Accelerating  SSL  using  the  vector  processors  in  IBM’s  Cell  Broadband  Engine  for 
Sony’s  Playstation  3.  Cryptology  ePrint  Archive,  Report  2007/061,  2007.  http :  /  /eprint .  iacr .  org/. 

[4]  Owen  Harrison  and  John  Waldron.  AES  encryption  implementation  and  analysis  on  commodity  graphics  processing  units, 
pages  209-226,  September  13  2007.  LNCS  Volume  4727/2007. 

[5]  NIST.  Secure  hash  standard  (SHS).  Technical  Report  FIPS  PUB  180-3,  National  Institute  of  Standards  and  Technology 
(NIST),  October  2008. 

[6]  R  Rivest.  The  MD5  message-digest  algorithm.  Technical  Report  RFC  1321,  MIT  Laboratory  for  Computer  Science  and 
RSA  Data  Security,  April  1992. 

[7]  Jon  Stokes.  End  of  the  line  for  ibm’s  cell,  November  23  2009.  http  :  // arstechnica  .  com/hardware/ 
news/ 2009/11/ end- of- the- line- for- ibms- cell . ars. 

[8]  Jason  Yang  and  James  Goodman.  Symmetric  key  cryptography  on  modern  graphics  hardware,  pages  249-264,  2007. 
LNCS  4833. 


47 


THIS  PAGE  INTENTIONALLY  LEET  BLANK 


INITIAL  DISTRIBUTION  LIST 


1 .  Defense  Teehnieal  Information  Center 
Ft.  Belvoir,  Virginia 

2.  Dudley  Knox  Library 
Naval  Postgraduate  Sehool 
Monterey,  California 

3.  Researeh  and  Sponsored  Programs  Offiee,  Code  41 
Naval  Postgraduate  Sehool 

Monterey,  California 


